The InferX X1 inference chip accelerates performance of neural network models such as object detection and recognition, and other neural network models, for robotics, industrial automation, medical imaging, gene sequencing, bank security, retail analytics, autonomous vehicles, aerospace and more. It runs YOLOv3 object detection and recognition 30% faster than Nvidia’s Jetson Xavier and runs other real-world customer models up to ten times faster, according to the company.
Many customers plan to use YOLOv3 in their products in robotics, bank security, and retail analytics because it is the highest accuracy object detection and recognition algorithm, says the company, while additional customers have custom models they have developed for a range of applications where they need more throughput at lower cost. The company says it has benchmarked models for these applications and demonstrated to these customers that InferX X1 provides the needed throughput and lower cost.
“Customers with existing edge inference systems are asking for more inference performance at better prices so they can implement neural networks in higher volume applications,” says Geoff Tate, CEO and co-founder of Flex Logix. “InferX X1 meets their needs with both higher performance and lower prices. InferX X1 delivers a 10-to-100 times improvement in inference price/performance versus the current industry leader.”
The InferX X1 silicon area is 54mm2 – 1/5th the size of a penny. Its high-volume price, says the company, is as much as 10 times lower than Nvidia’s Xavier NX, enabling high-quality, high-performance AI inference for the first time to be implemented in mass market products selling in the millions of units.
Technology details and specifications include the following:
- High MAC utilization up to 70% for large models/images translates into less silicon area/cost
- 1-Dimensional Tensor Processors (1D TPUs) are a 1D systolic array
- 64 byte input tensor
- 64 INT8 MACs
- 32 BF16 MACs
- 64 byte x 256 byte weight matrix
- One dimensional systolic array produces an output tensor every 64 cycles using 4096 MAC operations
- Reconfigurable Tensor Processor made up of 64 1D TPUs per X1
- TPUs can be configured in series or in parallel to implement a wide range of tensor operations; this flexibility enables high performance implementation of new operations such as 3D convolution
- Programmable interconnect provides a full speed, non-contention data path from SRAM through the TPUs to SRAM
- eFPGA programmable logic implements high speed state machines that control the TPUs and implement the control algorithms for the operators
- Each layer of a model is configured exactly as needed; reconfiguration for a new layer takes just microseconds
- DRAM traffic bringing in the weights and configuration for the next layer occurs in the background during compute of the current layer; this minimizes compute stalls
- Combining two layers in one configuration (layer fusion) minimizes DRAM traffic delays.
- Minimal memory keeps cost down: LPDDR4x DRAM, 14MB total SRAM
- x4 PCIe Gen 3 or Gen 4 provides rapid communication with the host
- 54 mm2 die size in 16nm process
- 21 x 21 mm flip-chip Ball Grid Array package
Sampling to early engagement customers will begin soon with broader sampling in Q1. Mass production chips and software will be available Q2 2021. Customer samples and advance Compiler and Software Tools will be available in Q1 2021.