The Data Lineage Challenge and What to Do About It

By Aron Semle
October 10, 2025
HighByte
Feature

Summary

Data contextualization is the key to understanding and preventing the implications of bad factory floor data in downstream applications.

The Data Lineage Challenge and What to Do About It

When your IT colleagues talk about data lineage, they are trying to understand the upstream and downstream connections for a given dataset and who is impacted by that data. They want to understand the origin of the data and the transformations it has undergone to arrive at its target destination, like an enterprise data lake. Unfortunately, this can be a difficult task for those of us who work with industrial data.

Factory floor data is diverse. Most sites produce telemetry data from machines and sensors, as well as transactional, time series, historical and file data. With these diverse data streams manufacturers face challenges not only in extracting meaningful context across the production line but also translating heterogeneous data from different sources and factories into actionable insights. If these challenges are not resolved, poor data quality leaves manufacturers vulnerable to inaccurate performance assessments, potential problems along the production line and the inability to proactively prevent machine failures or inefficiencies.

So how do we solve this? The first step is developing a comprehensive data strategy that begins with manufacturers cleaning up their data so that it is usable. This overhaul of the data management process requires a stronger understanding of data lineage—i.e., where the data is coming from, where it's going and how it's being used over time.

Connecting data lineage to data quality

Lineage and quality are intertwined concepts. Data lineage is crucial to helping manufacturers address important questions about poor data quality, like:

If the data received is bad, where did it come from?
Why was it bad and what went wrong?
How can I get real-time notifications when the quality of my data goes bad instead of being told weeks later by a business unit who needs the data for a project or regulatory requirement?

With the proper data lineage and observability tools, manufacturers can answer these questions and properly maintain data quality throughout the production process.

Of course, new AI solutions on the factory floor have made high-quality data more important than ever. Today, AI chatbots and agents are imperfect at deterministic tasks that require a “yes” or “no” answer. While AI can assist in detecting bad quality issues, manufacturers cannot feed “garbage” data into AI tools, or else risk hallucinations and unpredictable results. AI assistants and agents require high-quality, contextualized data that has been curated intentionally in order to accurately complete their tasks.

The importance of data context

If your IT colleague simply received a datapoint called “temperature 33.4” from a plant in Atlanta, Georgia while they were sitting in Seattle, Washington, they likely would not have any context on what the datapoint is referring to, what machine captured it, when it was collected and if this temperature was within an acceptable range. But the reality is that there is not one datapoint to solve–there are terabytes of datapoints.

Manufacturers must clean up their data at the edge, as close to the source as possible and add the proper context needed for data lineage so they can avoid these gaps in information and properly utilize their data across their entire production chain.

For most Industry 4.0 use cases, the context for a datapoint often lives in another system, meaning data must be collected from diverse sources in order for it to be properly contextualized. For example, predictive asset maintenance use cases may require you to collect raw machine data from one system, work order and planning information from another system and operator information from yet another system. Because this data comes in different formats and is largely made available through different interfaces, merging it all together into one cohesive view of the factory floor isn't easy.

Traditionally, manufacturers have taken the approach of “vacuuming up” all raw data into a data lake and transforming it as needed to fit their purposes. However, this approach often fails because raw manufacturing data is incredibly heterogenous and lacks the context needed for proper data lineage. Moreover, there is a persona problem: The data lake user does not have the domain knowledge needed to add context to the raw data.

These examples underpin why merging and contextualizing industrial data must be done at the edge by the domain expert.

Data lineage in practice

Revisiting the temperature reading from the plant in Atlanta, the manufacturer’s data lineage model should provide context from the time that specific datapoint was gathered: What machine it came from, which factory it’s located in, what the machine was producing and who was running the machine. With the right context, the analyst in Seattle can better interpret the temperature reading more succinctly and train a machine learning model to predict when that asset may need maintenance. This is especially important when data looks inaccurate or there are data gaps, as manufacturers will need to trace it back to its origin and better understand the steps taken to gather it.

While data lineage tools are mostly used once those diverse streams of data reach a data lake, we are making progress with data lineage on the factory floor. Manufacturers are beginning to embrace Industrial DataOps solutions and tools like OpenTelementry, an observability standard many IT systems use to monitor and manage data pipelines, to add as much context to their data as they can before it leaves the plant. This approach requires leveraging the people at the plant who best know the process and making sure data is properly tracked throughout production.

The consequences of data that lacks context can be disruptive to processes on the factory floor, but by overhauling their data infrastructure from pools of raw data to an activate network of machines and sensors that clearly tell the story of where data was sourced from, manufacturers can fortify their factories and evolve their operations for the AI age.

About The Author

Aron Semle is chief technology officer at HighByte. Unlock the value of your industrial data with HighByte Intelligence Hub, an edge-native, no-code solution that securely collects, models and delivers payloads to target applications across your enterprise.

Did you enjoy this great article?

Check out our free e-newsletters to read more great articles..