Convolutional Neural Network on FPGA beats all efficiency benchmarks

October 02, 2018 // By Julien Happich
With a focus on FPGA design and SoCs, IP provider Omnitek has released what the company claims to be the highest performance Convolutional Neural Network (CNN) on an FPGA, achieving over 50% higher performance than any competing CNNs and out-performing GPUs for a given power or cost budget.

The Omnitek Deep Learning Processing Unit (DPU) available now as a CNN will soon also be available as a Recurrent Neural Network (RNN) and a Multi-Layer Perceptions (MLP). In its highest performance implementation, the DPU features 12 CNN engines capable of churning out 20 TeraOps/s at better than 60% efficiency.

Discussing the merits of FPGAs for deep learning applications, Omnitek's CEO Roger Fawcett emphasized that while ASICs or fixed architectures are rapidly becoming obsolete in the fast evolving world of artificial intelligence, FPGAs provide the only platform to be rewired from one machine learning architecture to the next to achieve optimum efficiency at different tasks. A new bitstream is only what it takes for an FPGA to run a different neural network topology.

The DPU design flow: hardware optimization would be done by Omnitek as a consultancy job.

The CEO took Google's Tensor Processing Units (TPUs) as an example, which were reworked on a 2nd and a 3rd ASIC spin. Even then, when tasked with different workloads, the TPUs' efficiency can drop dramatically to only a few percent of its specified peak TeraOps/s figure. What's more the 3rd TPU version requires water cooling.

Demonstrated as a GoogLeNet Inception-v1 CNN, using 8-bit integer resolution, the Omnitek DPU achieves 16.8 TOPS performance and is able to inference at over 5,300 images per second on a Xilinx Virtex UltraScale+ XCVU9P-3 FPGA. This makes it highly suited to object detection and video processing applications at the Edge and in the Cloud, such as intelligent super resolution 8K upscaling for which performance is most important.

Image classification performed by Omnitek's DPU, at a throughput rate over 5,300 images per second.

The maximum number of engines on Omnitek's DPU was chosen as a perfect match to Xilinx' Virtex UltraScale+ XCVU9P and its 12,000 MACS and over 6000 DSP slices. Because the engines are all identical and run in parallel, power efficiency is determined by the number of engines put together. Smaller FPGAs could accommodate 4 engines or less.

Vous êtes certain ?

Si vous désactivez les cookies, vous ne pouvez plus naviguer sur le site.

Vous allez être rediriger vers Google.