Diving into Distributed Data Lakes Using GraphQL

By Jim Redman
March 10, 2023
Feature

Summary

Unstructured data comes in many forms and can give engineers useful information not offered by structured data. This feature originally appeared in the IIoT & Industry 4.0 edition of Automation 2023.

Diving into Distributed Data Lakes Using GraphQL

The main goal of Industry 4.0 is to provide increased efficiency, productivity and flexibility in manufacturing by using data and advanced technologies to optimize production processes. Increasingly, this data includes “unstructured” data. Unstructured data does not have a predefined format, for example, spreadsheets, waveforms, log files, images and videos. This data can provide engineers with valuable insights that are not available from traditional, structured manufacturing data alone, providing context and insights.

Data lakes provide a powerful way to process large amounts of structured and unstructured data. Data lakes have changed from a centralized, monolithic architecture to a distributed approach that integrates other systems, such as ERP and MES systems, and even data from shared drives such as spreadsheets and log files. This distributed data lake uses “data virtualization” or “data fusion” to provide a simple, locationindependent solution for querying data from all these sources.

Ease of access to the data lake and inclusiveness of data have become particularly important for AI and machine learning applications. Available tools are making building machine learning models simple, but the barrier continues to be incomplete or inaccurate data sets. While we are still mesmerized by the science fiction-seeming capabilities of AI/computer-generated images and large language models such as ChatGPT, failures of these models are usually related to a lack of data, not the underlying algorithm. A data lake can provide both high-quality and more complete data, leading to better performance and more accurate models.

Structured vs. unstructured data

Structured data is data that is organized in a specific format, such as a table or a schema, and can be easily searched, indexed and analyzed by a computer. Timestamped production data is a good example of structured data. Structured data has a predictable format, which makes it easy to process using traditional data processing tools like SQL.

Unstructured data, on the other hand, is data that does not have a specific format or structure and is not easily searchable, indexed, or analyzed by a computer. Examples of unstructured data include text documents, waveforms, log files, spreadsheets, images and videos. Unstructured data is characterized by a high degree of variability and a lack of predictability, which makes it more difficult to process using traditional data processing tools.

Data can also be semi-structured. This means that it has some structure but not enough to be considered structured data. An example of semi-structured data is JSON or XML, which have some structure but also contain unstructured data.

Many of the available “big data” tools can be used to build a data lake; however, GraphQL has emerged as a powerful tool for data fusion allowing precise and actionable data queries over the heterogeneous data sources. GraphQL is a flexible and powerful query language that provides a unified and intuitive way to access and retrieve data.

Figure 1: Data is collected from tools, equipment, and many other assets, and stored in a SQL database.

Data warehouses vs. data lakes

A data lake and a data warehouse are similar in that both are used to store and manage large amounts of data, but there are some key differences in architecture and usage. In manufacturing, we’ve been building “data warehouses.” In this architecture, data is collected from tools, equipment and many other assets, and stored in a SQL database (Figure 1).

A data warehouse is a centralized repository for structured data that is typically used for reporting and data analysis. Data in a data warehouse is organized, cleaned and transformed before it is loaded into the warehouse, making it easy to query and analyze.

This design has served us adequately in manufacturing. Much of the data we’re collecting lends itself readily to storing in a database table. We can collect groups of data, such as registers from a PLC, and store them with a timestamp. Each record is a row of data. The data is classic ABC—Analog, Binary and Character—manufacturing data and a good fit for a SQL database.

As we come to report on or analyze this data, it quickly become clear that our dataset is incomplete. In almost all organizations there are IT-related assets, MES, ERP and other systems from which we also need data. Furthermore, many organization have different databases and multiple data warehouses—each with additional information. These silos of information are not readily integrated and often have been created for different functions and are being managed by different organizational units.

Increasingly, diagnosing problems relies on data other than the simple ABC data. For example, the availability of low-cost cameras means that we can monitor equipment and, ideally, grab an image or video of the tool as an error occurs. We would like to link these images with the manufacturing data, but a traditional data warehouse handle images poorly, if at all.

Solving a problem or preparing a report or a presentation therefore relies on us having access to all of these sources of information, including data that is stored in SQL databases and a variety of other systems, on our local hard drive, or company-wide shared drives. Data such as images or waveforms may still be on the equipment that we access with FTP or similar applications. The format of this data may not be well defined. Information is often in spreadsheets, log files, web pages, drawings, custom formats, PDFs, or other documents. To use it, we must determine the nature and significance of each file.

At each stage, we apply our experience and knowledge to learn where the information is in the organization and the significance of each data item. To build automated applications, such as AI models, would require each application to replicate this. It would need knowledge of all locations, data formats, and the significance of the data. Without organized management, this is clearly impractical. Furthermore, adding data would require modifications to many applications, and any changes to the structures would break the entire workflow. A data lake can overcome the problems of a data warehouse and provide a data management solution for all data, both structured manufacturing data and unstructured data.

The initial approach was to build a monolithic, centralized data lake to be the repository of all data in an organization. This approach uses “big data” tools to build a highly scalable, yet cost effective centralized data store of structured and unstructured data in its raw form. Ideally, this software platform would provide a single ecosystem for big data analytics and data science initiatives.

It quickly becomes clear that this goal—a single location for all data—even if it were attainable is undesirable. It’s difficult to see a purpose for duplicating data that is already part of a working system, such as a data warehouse, MES, ERP, etc. into a data lake. It’s immediately obvious that it’s wasteful, both in terms of storage and data collection resources and data management effort. Synchronizing this data is also technically challenging.

Few organizations of any size have a single data warehouse for logistical, organizational, and many other reasons. Different departments and divisions build stores for different use cases, or IT systems are adopted as part of an acquisition, so even manufacturing data is stored in different databases. A universal data lake would face the same issues.

What we need is not to put all this data into a centralized system, but to be able to query, view and extract this data as if it were in a single system. Rather than try to copy all of our existing sources of data into a central system, we need to “wrap-and-embrace”—integrate these platforms so that they can be searched as a single platform.

With this concept, the data lake moves from being a centralized system to a heterogeneous and distributed set of data platforms. These platforms are integrated by “data virtualization,” allowing users and applications to query the data without caring about where or how the data is stored.

All of our existing data platforms, the data warehouses, MES, ERP, IT systems, and shared drives become a part of the data lake not by moving the data from these systems to a central system but by providing a virtualization layer. The platforms are integrated to present users, data scientists and application developers with a fusion of all of the data—a distributed, fused data lake.

Data virtualization

In a data warehouse queries are straightforward. Data warehouses are typically based on a SQL database, so queries can be crafted in SQL and the schema of the data is well defined.

It’s common to abstract SQL queries with a “REST” API. REST (Representational State Transfer) is the most widely used architectural style for building web services, and it’s been around for quite a while. RESTful APIs are based on the HTTP protocol and use standard HTTP methods like GET, POST, PUT and DELETE to interact with resources. This is especially common when the resources are located remotely, for example in the cloud.

Other technologies, SOAP, MQTT and many of the IOT protocols can be used in place of a REST API, but as we move to a data lake, the challenge in all these cases is defining the payload of the protocol—how do you ask for data and what’s the significance of the data returned.

Typically with these technologies, once implemented, the data format is fixed. With REST or other solutions, there is very little flexibility in the data request/response. In particular, the main issues with REST are:

Over-fetching: The client often receives more data than is needed, which results in increased network traffic and slower response times.
Under-fetching: The client often has to make multiple requests to different endpoints to fetch all the data needed for a given view, which results in increased complexity and slower response times.

These problems worsen as the complexity of the data increases.

Data virtualization with GraphQL

In 2012, to address the limitations and inefficiencies of their existing REST APIs, Facebook (now Meta) developed GraphQL. GraphQL was open-sourced in 2015 and has been widely adopted and used by many organizations and companies, as well as being a standard for developing APIs. GraphQL is a query language and runtime for building and executing client-server queries.

In GraphQL, the client makes a request to the server specifying the fields it wants to retrieve. The server responds with the requested data, structured in the same way as the request. This allows the client to retrieve exactly the data needed, not more, as it would with a REST API.

A GraphQL server presents a schema to the client. The schema defines the structure of the data that can be queried and serves as a contract between client and server. It can be used to validate queries and ensure that the client only requests data available on the server. Adding GraphQL as our virtualization layer resolves the major challenges of the distributed, fused data lake.

The GraphQL adapters, which are implemented as GraphQL servers, solve the problem of attaching significance to data of all types. The adapter provides the schema for the custom data it is wrapping without regard to any other system. This makes for relatively small, self-contained and self-describing components (Figure 2). For standard data types, such as SQL, these adapters already exist. GraphQL supports a wide range of languages so these adapters can be created programmatically or in a no-code/low-code environment. The simplicity of defining the adapters means that they can be created by the data-owners, who know the nature and significance of the data best. They can accurately describe unstructured and structured data.

Figure 2: The adapter provides the schema for the custom data it is wrapping without regard to any other system. This makes for relatively small, self-contained, and self-describing components.

These adapters remove the need to duplicate the data by providing access to data in place and the ability to query that data. The adapter also renders the location of the data irrelevant. It could be on-premise, in a remote location, or cloud-based. For further transparency, the GraphQL adapters can be aggregated, allowing complex queries to be executed against one server but with the results provided by multiple adapters.

The data lake is also hugely scalable. In particular, adding more data requires only the creation of an adapter to all the querying of the information by any client.

With the increasing amount of data generated by manufacturing systems, the importance of data lakes and advanced technologies such as GraphQL, AI and machine learning will only continue to grow. Companies that are able to effectively leverage these technologies will have a significant competitive advantage in the industry.

For more information, visit ErgoTech Systems.

_{This feature originally appeared in the IIoT & Industry 4.0 edition of Automation 2023.}

About The Author

Jim Redman, as president of ErgoTech Systems, Inc., was delivering what has become “IIoT” systems way back in 1998. ErgoTech’s MIStudio suite reflects his holistic vision to provide a single tool for integration and visualization from sensor to AI, and from tiny IIoT to worldwide cloud. Jim can be reached at [email protected].

Did you enjoy this great article?

Check out our free e-newsletters to read more great articles..