This note summarises details of some of the new silicon chips for machine intelligence. Its aim is to distil the most important implementation and architectural details (at least that are currently available), to highlight the main differences between them. I’m focusing on chips designed for training since they represent the frontier in performance and capability. There are many chips designed for inference, but these are typically intended for use in embedded or edge deployments.

In summary:

*Speculated
† Figures given for a single chip
Chip Process Die size
mm2
TDP
(W)
On-chip RAM
(MB)
Peak FP32
(TFLOPs)
Peak FP16
(TFLOPs)
Mem b/w
(GB/s)
IO b/w
(GB/s)
Cerebras WSE TSMC 16nm 510 180 225 40.6 n/a 0 Unknown
Google TPU v1 28nm Unknown 75 28 n/a 23 (INT16) 30 (DDR3) 14
Google TPU v2 20nm* Unknown 200* Unknown Unknown 45 600 (HBM) 8*
Google TPU v3 16/12nm* Unknown 200* Unknown Unknown 90 1200 (HBM2)* 8*
Graphcore IPU 16nm 800* 150 300 Unknown 125 0 384
Habana Gaudi TSMC 16nm 500* 300 Unknown Unknown Unknown 1000 (HBM2) 250
Huawei Ascend 910 7nm+ EUV 456 350 64 Unknown 256 1200 (HBM2) 115
Intel NNP-T TSMC 16FF+ 688 250 60 Unknown 110 1220 (HBM2) 447
Nvidia Volta TSMC 12nm FFN 815 300 21.1 15.7 125 900 (HBM2) 300
Nvidia Turing TSMC 12nm FFN 754 250 24.6 16.3 130.5 672 (HBM2) 100

Cerebras Wafer-Scale Engine

The Cerebras Wafer-Scale Engine (WSE) is undoubtedly the most bold and innovative design to appear recently. Wafer-scale integration is not a new idea, but integration issues to do with yield, power delivery and thermal expansion have made it difficult to commercialise (see the 1989 Anamartic 160 MB solid state disk). Cerebras use this approach to integrate 84 chips with high-speed interconnect, uniformly scaling the 2D-mesh based interconnect to huge proportions. This provides a machine with a large amount of memory (18 GB) distributed among a large amount of compute (3.3 Peta FLOPs peak). It is unclear how this architecture scales beyond single WSEs; the current trend in neural nets is to larger networks with billions of weights, which will necessitate such scaling.

General details:

Interconnect and IO:

Each core:

Each die:

References:

Google TPU v3

With few details available on the specifications of the TPU v3, it is likely an incremental improvement to the TPU v2: doubling the performance, adding HBM2 memory to double the capacity and bandwidth.

General details (per chip):

IO:

References:

Google TPU v2

The TPU v2 is designed for training and inference. It improves over the TPU v1 with floating point arithmetic and enhanced memory capacity and bandwidth with HBM integrated memory.

General details (per chip):

Each core:

IO:

References:

Google TPU v1

Google’s first generation TPU was designed for inference only and supports only integer arithmetic. It provides acceleration to a host CPU by being sent instructions across PCIe-3, to perform matrix multiplications and apply activation functions. This is a significant simplification which would have saved much time in design and verification.

General details:

IO:

References:

Graphcore IPU

DISCLAIMER: I work at Graphcore, and all of the information given here is lifted directly from the linked references below.

The Graphcore IPU architecture is highly parallel with a large collection of simple processors with small memories, connected by a high-bandwidth all-to-all ‘exchange’ interconnect. The architecture operates under a bulk-synchronous parallel (BSP) model, whereby execution of a program proceeds as a sequence of compute and exchange phases. Synchronisation is used to ensure all processes are ready to start exchange. The BSP model is a powerful programming abstraction because it precludes concurrency hazards, and BSP execution allows the compute and exchange phases to make full use of the chip’s power resources. Larger systems of IPU chips can be built by connecting the 10 inter-IPU links.

General details:

IO:

Each core:

References:

Habana Labs Gaudi

Habana’s Gaudi AI training processor shares similarities with contemporary GPUs, particularly wide SIMD parallelism and HBM2 memory. The chip integrates ten 100G Ethernet links which support remote direct memory access (RDMA). This IO capability allows large systems to be built with commodity networking equipment, compared with Nvidia’s NVLink or OpenCAPI.

General details:

TPC core:

IO:

References:

Huawei Ascend 910

Huawei’s Ascend also bears similarities to the latest GPUs with wide SIMD arithmetic and a 3D matrix unit, comparable to Nvidia’s Tensor Cores, a (assumed) coherent 32 MB shared L2 on-chip cache. The chip includes additional logic for 128 channel video decoding engines for H.264/265. In their Hot Chips presentation, Huawei described overlapping the cube and vector operations to obtain high efficiency and the challenge of the memory hierarchy with ratio of bandwidth to throughput dropping by 10x for L1 cache (in the core), 100x for L2 cache (shared between cores), and 2000x for external DRAM.

General details:

Interconnect and IO:

Each DaVinci core:

References:

Intel NNP-T

This chip is Intel’s second attempt at an accelerator for machine learning, following the Xeon Phi. Like the Habana Gaudi chip, it integrates a small number of wide vector cores, with HBM2 integrated memory and similar 100 Gbit IO links.

General details:

IO:

TPC core:

Scaling:

References:

Nvidia Volta

Volta introduces Tensor Cores, HBM2 and NVLink 2.0, from the Pascal architecture.

General details:

IO:

References:

Nvidia Turing

Turing is an architectural revision of Volta, manufactured on the same 16 nm process, but with fewer CUDA and Tensor cores. It consequently has a smaller die size and lower power envelope. Apart from ML tasks, it is designed to perform real-time ray tracing, for which it also used the Tensor Cores.

General details:

IO:

References:

Further reading

See this thread on Hacker News for discussion of this note.