1000 processor cores on a single chip; Californian researchers build ‘KiloCore’

1000 processor cores on a single chip; Californian researchers build ‘KiloCore’

Technology News |
A team from University of California, Davis, Department of Electrical and Computer Engineering, has designed a 1000-core processor, with an ultimate throughput rate of 1.78 trillion instructions per second and contains 621 million transistors. The KiloCore was presented at the 2016 Symposium on VLSI Technology and Circuits in Honolulu on June 16, 2016.
By Graham Prophet


“To the best of our knowledge, it is the world’s first 1,000-processor chip and it is the highest clock-rate processor ever designed in a university,” said Bevan Baas, professor of electrical and computer engineering, who led the team that designed the chip architecture. While other multiple-processor chips have been created, none exceed about 300 processors, according to an analysis by Baas’ team. Most were created for research purposes and few are sold commercially. The KiloCore chip has been fabricated and run; it was built by IBM using its 32 nm PD-SOI CMOS technology.


The basic architecture is MIMD (multiple instruction/multiple data) and each of the 7-stage-pipelined cores is a general purpose unit with a 72-instruction set, single instruction/cycle. The team says that none of the instructions is ‘algorithm-specific’ – so distinguishing it from a GPU-class device. The 1.78 trillion instructions/sec figures comes with a clock speed of 1.78 GHz, at 1.1V: running at 0.84V and 1 GHz consumes 13.1W, while peak power efficiency of 5.8 pJ/Op is quoted at 0.56V and 115 MHz.


Each core is independently powered and can shut down to leakage-only power if it has no task to perform. Rather than a cache architecture, every processor can store instructions and data in a hierarchy of locations; local memory, one or more nearby processors, on-chip independent memory modules, or off-chip memory. Each processor communicates via a high-throughput circuit-switched network plus a packet-switched network (both on-chip). The team says there is little energy overhead to source operands from companion processors some way across the chip, as ‘wormhole’ routing is employed. That is, messages from an adjacent or nearby core will be routed via the ‘circuit’ network; those from further away in the processor matrix will travel via the packet network. Each core has north-south-east-west comms buffers plus a fifth channel for host-processor traffic; maximum throughput is 45.5 Gbps per router and 9.1 Gbps per port at 1.1V. At 0.9V, maximum throughput is 27.1 Gbps at 3.36 mW and at 0.67V, it is 8.1 Gbps at 429 μW.


KiloCore’s 1000 processors, 1000 packet routers, and 12 independent memories are clocked by local oscillators that do not use PLLs and may change frequency, halt within 1-5 clock periods, and restart in less than one clock period to reduce power dissipation. The chip measures (nearly) 8 mm square, and has 32 rows of 32 processor cores (=992) with the remaining eight cores in a final row, with memory.


A major challenge of working with high-number core arrays is scheduling tasks and keeping all the cores busy. The team has created a programming model and compiler; they say that programming is by a multi-step process that allocates programs to processors. However, to make use of available packaging, only the central 160 cores have been powered in tests; figures for full-chip performance are presumed to be extrapolations.


Each processor core can run its own small program independently of the others, which is a fundamentally more flexible approach than so-called single-instruction-multiple-data approaches utilized by processors such as GPUs; the idea is to break an application up into many small pieces, each of which can run in parallel on different processors, enabling high throughput with lower energy use, Baas said, adding that the the chip is the most energy-efficient “many-core” processor ever reported. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7W, low enough to be powered by a single AA battery. The KiloCore chip executes instructions more than 100 times more efficiently than a modern laptop processor.


Applications already developed for the chip include wireless coding/decoding, video processing, encryption, and others involving large amounts of parallel data such as scientific data applications and datacentre record processing.


University of California at Davis; www.ucdavis.edu/news/worlds-first-1000-processor-chip



Linked Articles
eeNews Embedded