I recently attended a workshop in St. Louis—Big Data, Predictive Analytics and the Industrial Internet of Things—sponsored by our partner OSIsoft, Rockwell Automation and Anheuser-Busch (the makers of Budweiser beer for the uninitiated). The event focused on how industrial companies can deploy new technologies like cloud, machine learning and mobile to turn raw data into analytical information that improves businesses outcomes.
The presentations were informative, but mostly missing from the discussion was a deeper exploration of how poor data quality makes effective industrial analytics hard to achieve, especially at an enterprise scale.
“Poor data quality makes effective industrial analytics hard to achieve”
We’ve all heard that the Industrial Internet of Things will hasten “an explosion of sensor-generated data across billions of industrial devices and assets.” That’s no secret. What seems to be a secret, though, is the explosion of poor quality data that will ultimately flow into control, manufacturing and enterprise IT systems. Why? Because today’s industrial data, particularly time-series historian data, has accumulated inconsistencies from different and evolving systems over the years. As a result, it has two major challenges: 1) It mostly lives as it has always lived—in varying formats, unlabeled, with inconsistent naming, ending up in cold storage, unavailable for broad use; and 2) You don’t know if you can trust that your sensors are sending you the correct data values—i.e. “is my asset failing, or is my sensor failing?”
This poor data quality makes even the most basic analytics hard to do, let alone doing data science-driven, machine learning enable diagnostic and predictive analytics. For example, without well-organized, trusted data, operators can’t easily view the operating history of a fleet of assets and are instead stuck with ad-hoc analysis on a tag-by-tag basis, or with specific tags hardcoded into their dashboard. Fleet-wide views are essential for modern industrial organizations, but getting there requires significant upfront work and data structuring that can be prohibitively time and cost intensive. And if your tags or implementation changes, all the analytics and data that you’ve set up would need to be rebuilt.
“Significant upfront work and data structuring… can be prohibitively time and cost intensive”
Traditional ETL (Extract, Transform, Load) data practices don’t solve the data quality problem because they don’t solve the inherent problems of organizing, normalizing, labeling and cleaning the data at the source. These ETL approaches have also grown too fragmented, are labor intensive, and require too much customization. Such approaches are unable to scale to meet the needs of a modern, data-driven industrial organization seeking to combine time-series operating data with data from asset management systems, manufacturing systems, accounting systems and other ERP modules. Data can no longer be a simple one-way trip to a data warehouse or data lake.
Compounding the data quality challenge is the shortage of data science resources across every organization, especially industrial organizations. These scarce data scientists and data analysts are made inefficient when they have to spend 80% of their time wrangling poor quality data to prepare it for modeling and analysis.
So how do we address the industrial data quality problem?
Here are 4 critical steps to data readiness to overcome the data quality issues:
Industrial analytics requires knowing which time-series data streams are associated to which piece of equipment and to which process in the plant. Because this relational information is not stored along with the time-series data itself, performing analysis on the data forces operators into “hunter-gatherer” mode to query the data in order to find a single data stream they want to analyze, and then “rinse and repeat” the same onerous process for each subsequent data stream. This makes comparing data at an equipment or asset level relatively impossible, thereby making advanced analytics very difficult if not impossible. Furthermore, similar assets may not be similarly instrumented—one asset may emit 10 data streams while another, similar asset emits only 5, making analysis difficult unless the data is cleaned up to account for the difference. This lack of well-defined relational information wasn’t an issue when a single industrial site only contained a few thousand data streams, but it creates a painful journey for today’s operators.
Standardizing time-series data associates each data stream to an asset template, enabling apples-to-apples comparisons between like assets, and the ability to combine relational information. Standardization also organizes assets into hierarchical, or process oriented relationships to create the “graph” of your industrial operations.
Event information (i.e. when did a machine turn on/off, experience a surge, or overheat) is essential to understand the changing state of affairs in assets and processes. When things go wrong event information must be used to help debug the issues at hand. Unfortunately, there is no event log in the current way time series data is stored so engineers must resort to using time-series graphs and trend lines to identify events and to debug issues. Without context provided by events, it’s difficult to identify why problems are occurring — imagine debugging software without event logs, making root cause analysis difficult to perform. For diagnostic and predictive analytics, the events taking place in the data streams need to be surfaced and appropriately labeled.
Contextualization uses several practices, including machine learning to surface events in time series data that can then be labeled by a Subject Matter Expert in order to use the event signatures for diagnostic and predictive analysis. Machine learning-driven contextualization creates the richest repository of events by surfacing like events across years and years of data, and hundreds, even thousands of similar equipment types.
Data quality issues related to sensor noise, null/static values, and calibration issues are a leading contributor of poor outcomes when doing industrial analytics. Why? Sensor data are inherently noisy, often contain large gaps in the data, and can experience calibration drift over time. When dealing with millions of sensors, these quality issues become significant. Some customers have stated that these sensor level quality issues are the root cause of 40% of the issues they experience in their highly automated operations. Establishing sensor trust requires being able to effectively identify bad actor sensors through fleet-wide reporting on null values, data gaps, flat lined sensors, calibration and drift issues, as well as noise.
Time-series data alone is not sufficient to conduct advanced industrial analytics. Accordingly, cleansed time-series data must be joined with essential related data from other enterprise systems including the Enterprise Work and Asset Management or CMMS system, the Laboratory Information Management System, Supply Chain System and the Manufacturing Execution System. The joined data must be put into a form that allows for efficient querying across relevant dimensions, such as extracting one sensor on similar assets at different sites, or multiple sensors from different, but inter-connected assets.
Managing the logistics of all this data, whether at rest or in motion, requires taking a Data Supply Chain approach in which data flows in a continuous, high quality process (Standardize, Contextualize, Establish Sensor Trust, Manage Data Logistics) between systems in order to be cleaned up, transformed, combined with other data, and provided context in order to conduct high impact analytics for the business problems being addressed. This is not a one-time process like ETL because data, and data quality, is constantly changing, new assets and processes come online and replace other processes. This requires continuous cleansing and transformation to support analytics that is unique to the time-series format of sensor-based data. This is an operating principle, and a mental shift, as much as it is a set of tools or processes to adopt.
These data readiness steps will help industrial companies to overcome and resolve the data quality issues, enabling them to get to self-serve descriptive, diagnostic and predictive analytics, quickly and intuitively.