AI Inference at the Rugged Edge: Meeting Performance with M.2 Accelerators

AI Inference at the Rugged Edge: Meeting Performance with M.2 Accelerators
AI Inference at the Rugged Edge: Meeting Performance with M.2 Accelerators

Balancing power, performance, thermals and footprint is the next hurdle in data-driven applications. As the number of Internet of Things (IoT) and industrial IoT (IIoT) devices continues to increase, so does the volume and velocity of data they generate. This trend, combined with the realities of continually increasing numbers and types of connected devices, creates a wealth of new opportunities for purpose-built computing solutions. And one that also demands a different approach to hardware designs that enable optimized performance.

Data drives business innovation, and most important, the ability for cognitive machine intelligence. On the factory floor, powering smart kiosks or advanced telematics, fueling surveillance and passenger services in infrastructure facilities like airports and train stations—data is everywhere and adds value when it can be revealed, captured, analyzed and applied in real time. Yet for many of these applications performing in rigorous industrial environments, running small automated or artificial intelligence (AI) tasks from a data center is too inefficient to add true value. In this traditional centralized compute structure, power and costs are too high. This is due to excessive, but necessary, use of compute, storage and bandwidth resources. Performance trade-offs deepen the sacrifice, with factors such as high latency and insufficient data privacy. 

For system designers, this means that yesterday’s performance acceleration strategies may no longer fit the bill. While CPU/GPUbased designs have helped manage the slowing of Moore’s Law, these processor architectures are now having difficulty keeping in step with the real-time data requirements inherent to automation and inference applications. This is particularly true in more rigorous non-data center scenarios. It’s not just the data-intensive nature of automation that is causing change—it’s where it is being implemented (Figure 1). As applications move out of the data center and into the world, more industrial and nontraditional computing settings are seeking greater competitive value from data in real time. This can be defined as the rugged edge, and it is here that performance acceleration requires a new path.

Figure 1: Gartner predicts that by 2025, 75%25 of enterprise data will be processed at the edge.

Combined with the mounting challenge to meet priceperformance-power demands, it’s more critical than ever that performance acceleration consider compute, storage and connectivity. All these factors are necessary to effectively consolidate workload close to the point of data generation, even in rugged settings where environmental challenges are detrimental to system performance.

Edge computing hardware is being deployed to cope with this increasing amount of data and to alleviate the ensuing burdens placed on the cloud and in data centers. Data-intensive workloads like AI inference and deep learning models are moving beyond the data center and into factories and other industrial computing environments. In turn, designers and developers are recognizing a current shift for performance acceleration closer to data sources such as IoT sensors. The trend is pushing edge hardware further, meeting the need to interact with AI workloads on-demand. Based on data growth coupled with the complexities of edge computing environments, today’s AI computing framework is moving from general CPU and GPU options to more specialized accelerators such as smaller and more power-efficient acceleration modules in an M.2 standard.

This is where M.2 form-factor accelerators come into play for eliminating performance barriers in data-intensive applications. A powerful design option, M.2 accelerators offer system architects domain-specific value to match the exact requirements of AI workloads. In contrast to a comparable system using CPU/GPU technologies, an M.2-based system can manage inference models significantly faster and far more efficiently. These increases are driving innovative system design perfect for the rugged edge where more systems are deployed in challenging, nontraditional scenarios, and where purpose-built systems offer immense opportunity. Here, there is a clear differentiation between a general-purpose embedded computer and one that’s designed to handle inferencing algorithms by tapping into more modern acceleration options like M.2 domainspecific acceleration modules.

Domain-specific architectures handle only a few tasks, but they do so extremely well. This is an important shift for system developers and integrators, as the ultimate goal is to improve the cost versus performance ratio when comparing all accelerators, including CPUs, GPUs and now M.2 options. 

Using M.2 accelerators, DSAs operate deep neural networks from 15 to 30 times faster (with 30 to 80 times better energy efficiency) than a counterpart network relying on CPU and GPU technologies. While general-purpose GPUs are commonly used to provide enormous processing power for advanced AI algorithms, they are not optimized for edge deployments in remote and unstable environments. Drawbacks of size, power consumption, and heat management generate additional operating costs on top of the upfront cost of the GPU itself. Specialized accelerators such as TPUs from Google and M.2 acceleration modules are new solutions that are compact, power efficient and come purpose-built for driving machine learning algorithms at the edge with incredible performance.

Diving into M.2 and domain-specific architectures

Accelerators deliver the considerable data processing required, filling the gaps caused by the deceleration of Moore’s Law, which for decades was a driving force in the electronics industry. This longestablished principle asserts that the number of transistors on a chip will double every 18 to 24 months. When it comes to AI, however, industry experts are quick to point to signs of wear in Moore’s Law. Silicon evolution in and of itself cannot support AI algorithms and the processing performance they require. To balance performance, cost and energy demands, a new approach must feature domain-specific architectures (DSAs) that are far more specialized. Customized to execute a meticulously defined workload, DSAs provide a fundamental tenet for ensuring performance that facilitates deep learning training and deep learning inference.

The M.2 interface: a compact, versatile next generation option

Figure 2: M.2 Intel Optane Memory: Intel’s speed-boosting cache storage in an M.2 format. Developed to accelerate cache for another drive to enable high-speed computing Source: Intel
Intel developed the M.2 (next generation form factor [NGFF]) interface for flexibility and powerful performance. M.2 supports multiple signal interfaces such as PCI Express (PCIe 3.0 and 4.0), Serial ATA (SATA 3.0), and USB 3.0. A range of bus interfaces enable M.2 expansion slots to be highly versatile for different storage protocols, performance accelerators, wireless connectivity and input/output (I/O) expansion modules. For example, M.2 expansion slots can be used to add wired and wireless capabilities or a range of M.2 SSDs with different sizes and specifications.

Figure 3: M.2 VPU: Intel’s Movidius VPU (vision processing unit), developed to enhance machine learning and inferencing for edge computer vision that requires robust and compact technologies. Source: AAEON
Besides connectivity and storage expansion modules, performance accelerators (Figures 2-5) have quickly adopted the M.2 form factor to benefit from its compact and powerful interface. These performance accelerators include memory accelerators, AI accelerators, deep learning accelerators, inferencing accelerators and more. These new specialized processors dedicated to AI workloads provide an improved powerto-performance ratio. This is demonstrated by domainspecific workloads handled by M.2 accelerators versus heterogeneous compute SoCs such as CPU and GPU resources. 

At right are a few of the top M.2 performance accelerators that are available today.

Throughput matters: understanding benchmarks for real-world AI applications

Figure 4: M.2 TPU: Tensor Processing Unit, developed by Google to accelerate training on large and complex neural network models. A powerful and energyefficient AI accelerator in a compact M.2 form factor. Source: Coral.AI
Even the metrics by which industry experts measure compute performance are changing to accommodate the data-rich nature of AI applications. TOPS, or tera (trillions) operations per second, is a measure of the maximum possible throughput rather than a measure of actual throughput. TOPS identifies the number of hardware-implemented computation elements times their clock speed.

Figure 5: M.2 Hailo-8: AI Acceleration module—a best in class inference processor packaged in a module for AI applications; offers 26 tera operations per second and compatibility with NGFF M.2 form factor, M, B+M, and A+E keys.
While important, it is essentially a measure of what is possible if all the stars align in a given application; that is, steady data input, clean and consistent power sources, no memory limitations and perfect synchronization between hardware and AI software algorithms. As a theoretical measurement, TOPS also do not offer any consideration for other tasks the hardware may need to perform. Engineers focused on silicon implementation may find specific value in TOPS data, but software and hardware systems engineers may find that it does not clearly indicate true, available performance for their real-world application. 

Throughput, not TOPS however, is the more precise, real-world measurement. Throughput references the amount of data that can be processed in a defined time period, for example frames per second (FPS) in vision processing terms or the number of inferences in deep learning edge application. Inferences or FPS per watt, as related to a specific neural network task or application, is not only a more precise way of evaluating and comparing hardware, but also a more clearly understood, real-world metric. On this landscape, both ResNet-50 and YOLOv3 have emerged as leading options for AI performance evaluation, as well as use as a backbone in the development of new neural models.

ResNet-50, a pre-trained deep learning model
To solve computer vision challenges, machine learning developers working with convolutional neural networks (CNNs) add more stacking layers to their models. Ultimately, a higher number of levels increases saturation such that it creates a negative impact on performance of both the testing and training data models. ResNet-50 tackles this deterioration problem. Using residual blocks, or “skip connections” that simplify learning functions for each layer, ResNet-50 improves deep neural network efficiency while reducing errors.

YOLOv3, a real-time object detection algorithm
As a CNN, YOLOv3 (you only look once, Version 3) identifies specific objects in real time, for example in videos, live feeds, or images. YOLO’s classifier-based system interacts with input images as structured arrays of data—its goal is to recognize patterns between them and sort them into defined classes with similar characteristics. Only objects with similar characteristics are sorted; others are ignored unless system programming instructs attention. The algorithm allows the model to view the entire image during testing, ensuring its predictions are informed by a more global image context. A live traffic feed provides a good example—here, YOLO can detect various types of vehicles, examining high scoring regions and identifying similarities with certain predefined classes. 

Looking ahead: unlocking AI with real-time data performance in more environments

Data is key to today’s business innovation, and moreover, the ability to deliver cognitive machine intelligence. Data is all around us, and it is most valuable when it can be harnessed in real time. Myriad industries are eager to make the most of data to create new services and enhance business decisions, but in many rigorous industrial environments, processing small automated or AI tasks at the data center level is just too inefficient to provide true value. Power consumption and costs are too high in this legacy centralized compute structure because of excessive, albeit necessary, use of compute, bandwidth and storage resources. Further, high latency means that performance takes a hit, and insufficient data privacy creates another headache.

Data growth, combined with edge computing environment complexities, is driving the AI computing framework away from general CPU/GPU options and toward specialized accelerators based on domain-specific architectures that use the M.2 standard—options that are smaller and more power efficient. It’s a strategy to address a data challenge that is complex, real and not going away. Application designers and developers must recognize an urgent need for performance acceleration that resides closer to data sources and is purpose-built for the task at hand—particularly as edge computing hardware is deployed to cope with data processing and alleviate related burdens in data centers and in the cloud.

There is a clear differentiation between a general-purpose embedded computer and one that’s designed to handle inferencing algorithms. M.2 is proving to be a powerful design option for system architects, offering domain-specific value that meets the precise needs of AI workloads to eliminate performance roadblocks. For system developers, the opportunity for purpose-built systems is immense, with smarter data handling poised to advance AI and inference applications even more broadly across global infrastructure industries.

To see real-world benchmarks of various deep learning models that use M.2 performance accelerators modules in purpose-built industrial computing solutions, visit Premio Inc.

This feature originally appeared in the ebook Automation 2022 Volume 6: IIoT & Industry 4.0.

About The Author

Dustin Seetoo is the director of product marketing at Premio Inc. For more than 30 years, Premio has been a global solutions provider specializing in the design and manufacturing of computing technology from the edge to the cloud for enterprises with complex, highly specialized requirements.

Did you enjoy this great article?

Check out our free e-newsletters to read more great articles..