Layer Processing Unit (LPU)

AI accelerator for safe real-time vision, recognition and understanding

Revolutionary chip architecture boosts AI efficiency

Representation of Neural Layers

Deep Learning based systems in mobile applications is a rapidly growing area of application, especially for driving and safety related tasks in transportation and industry. The future of Artificial Intelligence and neural network based systems is neither CPU nor GPU or TPU, but a completely new AI chip architecture. The solution is called LPU – “Layer Processing Unit“, representing a paradigm shift in the hardware and software structure of neural networks, which European technology leader EYYES is already bringing to market in its products.

With a multiplication of parallel computing operations, the processing speed and data throughput are maximized. This enables the implementation of particularly powerful, energy-efficient systems.


Most advanced technology

How the new LPU works

First, the neural networks are optimized so that they can be used in an embedded/edge environment. The aim is to reduce the size of the network and the required computing operations without loss of quality.

The second step is to optimize the parallel computing operations in the IP processing core of the LPU. How does this work compared to a GPU and TPU?

Comparison with previous technologies

Graphics Processing Unit

While a GPU is fast, it can only process one vector operation at a time per clock cycle. To traverse the layers of a neural network, it needs many clock cycles, which results in high computational requirements and many memory accesses. GPUs are therefore comparatively inefficient for mobile AI applications.

Tensor Processing Unit
A TPU uses a tensor to calculate multiple vectors at once. However, even the TPU still requires many computational clocks to cache and finally process the computations of all the neurons in each layer.

The LPU can compute the tensors of all neurons in one layer of a neural network simultaneously in a single computation cycle, including addition of the results and consideration of the activation function of the neurons. It processes the incoming data in parallel and performs activation and pooling in the same operation step. This patented process enables the LPU to process the required billions of computing operations even at low clock frequencies. It is thus a highly efficient chip technology for embedded AI applications.

Layer Processing Unit (LPU)
The LPU can process layer by layer in a single clock cycle, therefore is significantly more efficient than previous hardware and software.

Parallel execution processes are the essential difference to the way graphics and tensor processors work. Improvements through these simultaneously running computing operations brings a revolutionary advantage: With comparable implementation in terms of clock frequency and chip technology, the performance is at least 3 times as high as with a GPU and twice as high as with a TPU, as the adjacent comparison graphic impressively shows.

LPU TOPs compared to GPU and TPU using comparable silicon technology and clock frequency (1.6Ghz, 7nm).

LPU highlights at a glance

  • Parallelization

    The outputs of all neurons of a layer are calculated simultaneously, inc. addition & activation function!

  • Enormous power

    Neural networks are simulated about twice as fast as on TPUs and 3x faster than on GPUs

  • Highest efficiency

    Maximum power with minimum power consumption and lower hardware costs!

  • Sicherheit

    Certifiable architecture for safety-relevant applications (ISO 26262) and real-time categorization with “explainable AI”.

  • Memory relief

    Lower memory requirements due to parallel processing.

  • Transmission rates

    Relief for the buffer enables on-air systems with low transmission rates

  • Low latencies

    Real-time processing with assured low latency < 100ms

  • Adaptable

    The LPU technology is open for different neural networks

System on Module with integrated LPU

The Real-Time Interface 3.0

Based on the Layer Processing Unit architecture, EYYES developed the Real Time Interface 3.0 (RTI 3.0). It is a system on module for visually based object recognition that can be used for a wide range of applications. Here, EYYES achieves the enormously high computing power of 18 TOPS on a small board of 67mm by 58mm while fully utilizing the area of the Xilinx Zynq 4 MPSOC FPGAS. Compared to SOMs with conventional TPU processing, the RTI 3 is characterized by low hardware costs, a particularly high energy saving of around 25% and flexible use in a wide range of applications.

The module is capable of processing two independent full HD video streams and outputting them to different interfaces. Its high flexibility also enables customized developments and integration via LINUX drivers, e.g. for autonomous driving assistants.RTI 3.0 Board

The custom on the shelf (COTS) functionality of the RTI 3.0 ensures that all basic applications for object detection, i.e. for people and vehicle recognition, are already implemented at delivery. It is thus a leading system for safety-relevant traffic applications that combines the highest performance with the most efficient utilization, leaving previous SOMs far behind in terms of energy efficiency, scalability and deployment flexibility.

Connection overview of the RTI 3.0:

Connection overview of the RTI 3.0 with LPU