Confusion about the Confusion Matrix

- Understanding machine learning accuracy for collection system maintenance benefit

- How to adopt machine learning in a crawl/walk/run/sprint fashion

A few years ago, Keanu Reeves was spotted in Alameda, CA eating some ice cream.  Why was he here?  What kind of ice cream was it?  What was he wearing?  Why did the internet (or at least the internet around me) care so much??  And why do women think he’s so good looking??  I was confused.  Ultimately it appears he was here because another version of “The Matrix” was in the works.  Was this just my “confusion matrix?”

In traditional regression analysis, the R squared value gives a measure of how accurate our chosen formula is at predicting the dependent variable.  The most straight forward regression is linear, but more complex nonlinear analysis can be done as well.  The smaller the R squared, the closer our chosen formula is to accurately predict outcomes of the dependent variable using one or more independent variables.

Gravity main cleaning is a significant expense for collection systems, with the goal of avoiding or minimizing SSOs.  There’s evidence that many gravity mains are cleaned unnecessarily.  To make an efficient model, we need to understand why some gravity mains require more or less cleaning than others. 

The first-level analysis is straight forward:  if a gravity main has failed before (had an SSO), it’s one of the most likely gravity mains to fail again.  Thus, gravity mains that have had SSOs are (rightfully) put on hot spot cleaning lists.

But if a gravity main has not had an SSO before, how do you distinguish one from the other?  What are the geospatial (location based) causal theories-of-causation?  Why do some mains need more cleaning than others?

A regression model can be built using age, size and material data – those are important factors.  And for grease, collection systems do a good job managing FSEs (food service establishments) and their grease traps.  But SSOs on gravity mains are caused by many things beyond age, size, material, and proximity to FSEs.  Residences contribute FOG (fats, oils, grease) to mains for a variety of potential reasons (density, demographics etc.).  Various tree and weather conditions generate roots differently.  Debris sources can vary.  Slope of gravity mains must matter to some extent. 

Regression models can’t handle many independent variables.  If data sources can be found for these geospatial causal theories-of-causation, then a sophisticated machine learning model can be developed for a collection system of virtually any size.  There is no limit to the number of independent variables that can be fed into a machine learning model.

So how would one judge the accuracy of such a machine learning model?  This is where it can get “confusing,” but a crawl/walk/run/sprint adoption expectation can get you through this. 

A confusion matrix looks like this:

Whew, now that IS confusing, and that’s not even “him.”

Here’s the real machine learning confusion matrix:

We build a machine learning model with many independent variables.  Our goal is to predict the dependent variable:  in this case, the need for each gravity main segment to be cleaned.  We consider and optimize the confusion matrix for each gravity main segment.

True-positive:  our model predicted that the gravity main segment needed to be cleaned, and when we cleaned it, we found that it did need cleaning.

True-negative:  our model predicted that the gravity main segment did NOT need to be cleaned, and the evidence showed that this was indeed the case.

Maximizing true-positives and true-negatives is the goal.

False-negative:  our model predicted that the gravity main segment did NOT need to be cleaned, and evidence showed that it DID.  The worst outcome.

False-positive:  our model predicted that the gravity main segment did need to be cleaned, but we found it perfectly clear when we did clean it.  An inefficient outcome.

In a machine learning model, our goal is to find the true-positives as efficiently as possible, while balancing the other predicted outcomes, and the optimal sum of this is measured through an “area under the curve” (AUC) analysis.  Google has excellent references for AUC.  https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

But all the mathematical explanation and Silicon Valley gobbledygook in the world isn’t going to get a collection system to just adopt this wholesale overnight.  What is the crawl/walk/run/sprint adoption cycle for this?  How does a collection system gain the confidence in this output and then slowly adopt the data science analysis into the operations of the cleaning crew?

Step 1: the crawl scenario.  Change nothing in your cleaning plans, but have the machine learning analysis in hand as you go about your normal operational plans.  Get a feel for how well the model is performing in the “confusion matrix” and compare that to how cleaning plans are done currently.  No model is perfect, but is the machine learning model an improvement on the current approach?  In hindsight, are there gravity mains we could have skipped this time around to focus on other high priority mains elsewhere?

Step 2:  the walk scenario.  Use the machine learning model as a “tie breaker” for unknown or contentious gravity mains where there is uncertainty from current models (or in easements that go through the home of the ferocious German shepherd!).  If there is disagreement about whether a gravity main should be cleaned so aggressively, what does the machine model say?

Step 3:  starting to run.  As confidence is gained in the machine learning model, adjust new plans based on the machine learning outputs.  Could you cut the cleaning of that basin by 10-20% with high confidence that those crews would deliver more benefit doing some other activity with that time?  (For example, more CCTV in exchange for 10-25% less cleaning).

Step 4:  the sprint scenario.  Machine learning models themselves, and the confidence in them, has a flywheel type effect.  The more data that goes in, the more accurate the model.  The more management and crews gain confidence in the model and work their own way through the confusion matrix, the more the machine model can be relied on.  Ultimately, a machine learning model can generate temporal (time based) degradation curves for each gravity main segment.  This can enable a risk vs cleaning planning process – the Maytag repairman scenario, where the maintenance is done just before a risk threshold is surpassed.  On a scale of 1-5 (5 worst), if the risk profile of the collection system is to clean all pipes BEFORE they become a “3”, how do you set cleaning plans to accomplish that? 

Anecdotal evidence suggests that a completely optimized cleaning approach planned in this fashion could cut 80% off current cleaning plans.  Big number, exciting.  Maybe.  We need to start crawling first, though!