The Omnitek Deep Learning Processing Unit (DPU) available now as a CNN will soon also be available as a Recurrent Neural Network (RNN) and a Multi-Layer Perceptions (MLP). In its highest performance implementation, the DPU features 12 CNN engines capable of churning out 20 TeraOps/s at better than 60% efficiency.
Discussing the merits of FPGAs for deep learning applications, Omnitek's CEO Roger Fawcett emphasized that while ASICs or fixed architectures are rapidly becoming obsolete in the fast evolving world of artificial intelligence, FPGAs provide the only platform to be rewired from one machine learning architecture to the next to achieve optimum efficiency at different tasks. A new bitstream is only what it takes for an FPGA to run a different neural network topology.
The CEO took Google's Tensor Processing Units (TPUs) as an example, which were reworked on a 2nd and a 3rd ASIC spin. Even then, when tasked with different workloads, the TPUs' efficiency can drop dramatically to only a few percent of its specified peak TeraOps/s figure. What's more the 3rd TPU version requires water cooling.
Demonstrated as a GoogLeNet Inception-v1 CNN, using 8-bit integer resolution, the Omnitek DPU achieves 16.8 TOPS performance and is able to inference at over 5,300 images per second on a Xilinx Virtex UltraScale+ XCVU9P-3 FPGA. This makes it highly suited to object detection and video processing applications at the Edge and in the Cloud, such as intelligent super resolution 8K upscaling for which performance is most important.
The maximum number of engines on Omnitek's DPU was chosen as a perfect match to Xilinx' Virtex UltraScale+ XCVU9P and its 12,000 MACS and over 6000 DSP slices. Because the engines are all identical and run in parallel, power efficiency is determined by the number of engines put together. Smaller FPGAs could accommodate 4 engines or less.