Wrangle that data to enable data science-driven benefits!
How do you leverage your GIS, CMMS, SmartCover, and CCTV data to drive OPEX/CAPEX efficiencies?
Collection Systems have invested millions of dollars over the past decade to digitize geographic assets, work order processes, and condition assessment analyses. You have invested in GIS, CMMS and CCTV – what’s the starting point for maximizing the return on those investments?
While machine learning and Artificial Intelligence get the hype and headlines, as much as 80% of any data science opportunity falls within the scope of the “data wrangling.”
Catching digital cows? Putting your data in a new pair of jeans? Sure!
Rope me up some data! It’s as easy as 1-2-3
The term “wrangling” is generally connected to the idea of herding and caring for cattle. Data needs similar “herding” and “caring” to make it useful. Wikipedia defines data wrangling as: “the process of transforming and mapping data from one ‘raw’ data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.” This is more than just “cleansing” your data. A quick google search for data wrangling will turn up a number of good definitions, but a good summary can be found in this link written by Tim Stobierski for Harvard Business School Online: https://online.hbs.edu/blog/post/data-wrangling
What does this mean for Collection Systems and optimizing processes using GIS, CMMS and CCTV data? Let’s take the example of using data to optimize the $ spent avoiding sewer system overflows (SSOs).
The basic data wrangling steps for such a process are: 1) improving and validating the quality of the data, 2) joining all available internal data, and then 3) bringing in and joining relevant external data to enrich the dataset.
1. Finding and solving internal data Garbage In, Garbage Out (GIGO) issues - “cleaning” the data
The “GIGO” term has been around as a computer term since the late 1950s, and it’s as applicable in today’s data science field as ever. An agency’s GIS database is where it all starts. This is the geographic representation of an agency’s assets: location, size, material, and age of assets such as underground pipes. Are all the ages (installation years), materials, and sizes, correct? Are some of those fields missing or blank? Are there mismatches between age and materials that don’t make sense (for example, PVC pipe installed in 1940 – one of those data points must be wrong)? Are like entries consistent (“null,” “0”, and ‘blank’ are three different things digitally)?
GIS data issues need to be identified and remedied by the best means available – for example, if installation years are missing, digital assumptions/estimates can be made based on parcel, building, or adjacent ‘neighbor’ type data. Most collection agencies believe their GIS is high quality, but a data wrangling process starts with a thorough audit of GIS quality and completeness – which often has surprising results.
CMMS systems are built on top of GIS, generate/track work orders, and manage basic assets. Leading collection agencies also use CMMS systems to collect data and observations from preventative maintenance activities, such as the cleaning routinely done on gravity sewer mains. Is this data being brought back into the CMMS in a consistent, high-quality fashion? Are your field maintenance activities consistent with your SSMP commitments?
For CCTV data, standards such as NASSCO substantially remedy the quality issue, but ensuring consistency across crews, vendors, and software packages can be important. Many agencies have historical CCTV data collected with different software (Wincan, Granite XP, etc.) – these need to all be normalized to be useful as a collective data set.
2. Breaking down the digital silos – joining all your data
GIS, CMMS, and CCTV software systems come from different vendors and are managed on different platforms and in different departments. To enable cross-platform analysis among these systems, an asset ID must be identified and verified as being common, consistent, and accurate across all data sets. Sewer gravity main segments are typically identified by Manhole-Manhole numbers. Are all those segment IDs common, consistent and accurate in GIS, CMMS, and all CCTV data? This “joining” process enables cross platform analytics and visualization of maintenance trends (mains cleaned/not cleaned in last two years, mains cleaned with an asset ID not found in GIS etc.).
3. Enriching the dataset for each asset with external data
Sophisticated data science leverages large sets of data to find and weigh correlations with underlying causal variables. To move from reactive maintenance to proactive targeted, efficient maintenance, it’s critical to understand some of the underlying causes behind the differing levels of maintenance needs of gravity sewer mains.
Why do some gravity main pipe segments need to be cleaned more often than others? Existing systems only track basic relations – fat-oil-grease (FOG) in gravity mains near food-service-establishment (FSEs) for example. What about residential contribution to FOG? Does the slope of the pipe segment have something to do with its need for cleaning? Does population density on a pipe segment matter? What kinds of trees create the worst root invasion issues?
Relevant external data sets are readily available and can be geospatially joined with the internal data sets to create a “deep, rich” dataset for each gravity main pipe segment – which in turn enables sophisticated analysis such as machine learning where each asset’s unique characteristics are analyzed and weighed.
This type of complete 1-2-3 data wrangling process can be done in as little as 2-3 months. At the end of such a process, the result is a deep, rich, analytics-ready data set for each gravity main segment. What’s the optimum maintenance frequency for each of my gravity mains? If I must cut back on my maintenance, how do I do that without risking my SSO rate? Are some pipe segments with specific characteristics effectively self-cleaning? Answering these questions and more starts with a well wrangled data set for each gravity main segment.
Let the data wrangling rodeo begin!