Metadata is simply data about data. Being data itself, metadata can, of course, have its own metadata and so on. So, while this is a mathematically elegant construction, it boils down to the “turtles all the way down” problem.
We are not philosophers (except at some of our happy hour gatherings!), so we need a more practical definition of metadata. One that we can reason about without running into the problem of infinite regress.
In order to understand what metadata is, let’s talk about 3 categories of data that make more sense in the world of industrial operations (and almost any other field, but here we will focus on industrial applications). First, we have “fact data”, then on top of that, we have “context data”, and finally we have “ontology” at the pinnacle.
Figure 1: Visualization of the 3 categories of data that make up our worldview of data and metadata
Together these 3 categories of data make up the data landscape you must work with in solving your business problems like predicting failures in equipment, charting overall equipment effectiveness (OEE), analyzing power usage at a chemical plant, or understanding which shifts at your factory are the most effective.
In this blog, we will explore the definition of the first two terms: Fact Data and Context Data (and save the third, Ontology Data, for later). Once all the terms are defined, then we’ll talk about how they relate to each other (and themselves), and in a future post what this means in the context of a digital transformation effort.
At the bottom of the data pyramid is the fact data. It describes things both physical and non-physical (people, virtual machines, quality control processes, etc). Generally, this data is also about observing those “things” at a point in time and recording these observations. It could be a list of access logs from a firewall, a PI Tag and its associated measurements of a continuous variable (like the temperature of a pump), or the maintenance logs at a solar farm.
Fact data is what you’re storing in a data lake, data warehouse, HDFS store, or the like. It’s typically append-only - last week's temperature data for a pump doesn’t change, but tomorrow you will have more or new temperature data to include. This leads to interesting storage semantics and allows for highly compressed representations for cost savings.
Fact data is BIG. According to IDC, there were 33ZB (that’s Zettabytes, or 10^21 bytes) in 2018, and by 2025, they predict an astonishing 175ZB of data across the world. To put a ZB into some perspective, it would take 1,000,000 1TB hard drives to hold a single ZB. If a single ZB were to be stored on DVDs, it would take about 212 billion DVDs, and would create a tower over 127,000,000 km tall (which is just a bit less than the distance between the earth and the sun). Wow.
Figure 2: The IDC projection for the size of the data across the world by 2025
Fact data is so big, that you can’t reasonably analyze it by just taking a look. You need to be able to find the right facts for the problem at hand. For most interesting use cases, you need facts from different sources, sometimes many different sources and that poses its own set of challenges, not all facts have all of the same context, or if they do, frequently that context doesn’t cleanly line up. To make sense of it, you are going to need some more context!
Context is queen in the world of zettabytes. Context tells you which facts are connected to which other facts, it can tell you where the facts reside, what their schema is, and any number of other attributes about the facts. Let’s walk through a simple example by way of explanation. Imagine we have a site that has a pump with fact data (e.g., voltage and amperage) being stored in a PI system, and a vibration sensing system storing the vibrations data. The whole thing is tied together with an EAM system.
Figure 3: Simple example of pump data and vibration data and their connections
Again, this is a very simplistic example, but as you can clearly see, there is important information about the context in the connections. This is particularly evident in the vibration tags, each of which has a completely impenetrable tag name, and the same unit of measure - the key is in how the tag is related to the equipment. Keeping track of the specific streams of data and where they live are a small subset of the uses for context data. For example, the above example also includes the units of measure for each type, allowing consuming systems to appropriately transform the units (to kilovolts perhaps) ensuring consistent units when computing against the fact data streams. You could also store format information for the data from a source system, or information about the authentication or communication protocols, or summary data about the quality of data in the underlying fact data streams.
Another key observation is that the contextual model changes over time as the physical things that the context represents are changed. Imagine you replaced a pump with a new model from another vendor or added another piece of equipment. That means the context has changed and should be updated, simple, right? It’s a little more complicated when you realize that just because the old version is no longer correct, that doesn’t mean it wasn’t correct before. For this reason, the contextual model is also a time series in and of itself.
In the context of a (near) real-time application, this isn’t much of an issue: you care about the context and associated facts as they exist right now. However, when you have a use case where the history is important, this becomes a critical issue which people tend to deal with by making simplifying assumptions. Most often that simplification takes the form of storing only the most recent version of the context - you know the truth now, but you don’t know the true context from last week, let alone last year. While that may work, it requires very careful attention to validate the simplifying assumptions for your application - if you get those wrong, your analysis is going to produce faulty insights. The chart below represents the state of the model over time (represented as m0, etc) and how the specific sensor data associated with that model changes as the contextual model evolves from m0 to m4.
Figure 4: Model state evolution
Figure 5: Further detail of the relevant facts within the contextual model
As you can see in this example, as the context changes over time, the relevant facts also change. To look at this system over time, the facts in question must also change, which means you need a mechanism to keep track of the model’s history and find the correct model for any given point in time. From the correct model, you can fetch the exact facts referenced by the model, and from there, you have the required facts and context to perform useful analytics.
The final key to this context puzzle is the proliferation of systems of record. In the example above, we have 2 (the PI system and the Bently Nevada system), but that is almost never how the real-world works. In a large industrial setting, you’ll have many plants, each with several systems of record, and several more at the enterprise level. Even if these are synchronized with a data lake, or some other centralized data store, it’s not in one format and you need this context to join the facts together in a useful way. This is not practical though - we’re always going to have multiple systems of record and we’ll always have uses outside that system for the data, so we’re going to require a context layer for the foreseeable future to deal with the panoply of data sources and use cases.
Is your head spinning yet? This is just a simple example of what most industrial companies face every day and that is not even all of it. But, as you can see, metadata is a slippery concept, if you talk to a data warehousing expert, they’ll tell you it’s the DDL (or data definition language) defining table structure, but it’s much more than that. It’s any data about other data. This doesn’t capture the key meaning of context and facts though which leaves us needing to use all of the terms rather than just the blanket term, metadata. In the next blog, we will pick up defining and adding the complexities of ontology to the metadata puzzle and find out how it all comes together. To be continued in Just What is Ontology Anyway?
Register to view a live demo or sign up for a free trial to use the software for 30 days. You can also purchase a Personal License from the AWS Marketplace or Azure Marketplace.
Questions? Please contact us.