Layer Processing Unit (LPU)
Revolutionary chip architecture boosts AI efficiency
Deep Learning based systems in mobile applications is a rapidly growing area of application, especially for driving and safety related tasks in transportation and industry. The future of Artificial Intelligence and neural network based systems is neither CPU nor GPU or TPU, but a completely new AI chip architecture. The solution is called LPU – “Layer Processing Unit“, representing a paradigm shift in the hardware and software structure of neural networks, which European technology leader EYYES is already bringing to market in its products.
With a multiplication of parallel computing operations, the processing speed and data throughput are maximized. This enables the implementation of particularly powerful, energy-efficient systems.
How the new LPU works
First, the neural networks are optimized so that they can be used in an embedded/edge environment. The aim is to reduce the size of the network and the required computing operations without loss of quality.
The second step is to optimize the parallel computing operations in the IP processing core of the LPU. How does this work compared to a GPU and TPU?

Comparison with previous technologies
While a GPU is fast, it can only process one vector operation at a time per clock cycle. To traverse the layers of a neural network, it needs many clock cycles, which results in high computational requirements and many memory accesses. GPUs are therefore comparatively inefficient for mobile AI applications.

A TPU uses a tensor to calculate multiple vectors at once. However, even the TPU still requires many computational clocks to cache and finally process the computations of all the neurons in each layer.
The LPU can compute the tensors of all neurons in one layer of a neural network simultaneously in a single computation cycle, including addition of the results and consideration of the activation function of the neurons. It processes the incoming data in parallel and performs activation and pooling in the same operation step. This patented process enables the LPU to process the required billions of computing operations even at low clock frequencies. It is thus a highly efficient chip technology for embedded AI applications.

Parallel execution processes are the essential difference to the way graphics and tensor processors work. Improvements through these simultaneously running computing operations brings a revolutionary advantage: With comparable implementation in terms of clock frequency and chip technology, the performance is at least 3 times as high as with a GPU and twice as high as with a TPU, as the adjacent comparison graphic impressively shows.

LPU highlights at a glance
-
Parallelization
The outputs of all neurons of a layer are calculated simultaneously, inc. addition & activation function!
-
Enormous power
Neural networks are simulated about twice as fast as on TPUs and 3x faster than on GPUs
-
Highest efficiency
Maximum power with minimum power consumption and lower hardware costs!
-
Sicherheit
Certifiable architecture for safety-relevant applications (ISO 26262) and real-time categorization with “explainable AI”.
-
Memory relief
Lower memory requirements due to parallel processing.
-
Transmission rates
Relief for the buffer enables on-air systems with low transmission rates
-
Low latencies
Real-time processing with assured low latency < 100ms
-
Adaptable
The LPU technology is open for different neural networks
System on Module with integrated LPU
The Real-Time Interface 3.0
Based on the Layer Processing Unit architecture, EYYES developed the Real Time Interface 3.0 (RTI 3.0). It is a system on module for visually based object recognition that can be used for a wide range of applications. Here, EYYES achieves the enormously high computing power of 18 TOPS on a small board of 67mm by 58mm while fully utilizing the area of the Xilinx Zynq 4 MPSOC FPGAS. Compared to SOMs with conventional TPU processing, the RTI 3 is characterized by low hardware costs, a particularly high energy saving of around 25% and flexible use in a wide range of applications.
The module is capable of processing two independent full HD video streams and outputting them to different interfaces. Its high flexibility also enables customized developments and integration via LINUX drivers, e.g. for autonomous driving assistants.
The custom on the shelf (COTS) functionality of the RTI 3.0 ensures that all basic applications for object detection, i.e. for people and vehicle recognition, are already implemented at delivery. It is thus a leading system for safety-relevant traffic applications that combines the highest performance with the most efficient utilization, leaving previous SOMs far behind in terms of energy efficiency, scalability and deployment flexibility.