At this year's
Nvidia's premier conference for technical computing with graphic
processors, the company reserved the top keynote for its CEO Jensen
Huang. Over the years, the GTC conference went from a segment in a
larger, mostly gaming-oriented and somewhat scattershot conference
called "nVision" to become one of the key conferences that mixes
academic and commercial high-performance computing.
Jensen's message was that GPU-accelerated machine learning is growing
to touch every aspect of computing. While it's becoming easier to use
neural nets, the technology still has a way to go to reach a broader
audience. It's a hard problem, but Nvidia likes to tackle hard problems.
The Nvidia strategy is to disburse machine learning into every
market. To accomplish this, the company is investing in Deep Learning
Institute, a training program to spread the deep learning neural net
programming model to a new class of developers.
Much as Sun promoted Java with an extensive series of courses, Nvidia
wants to get all programmers to understand neural net programming. With
deep neural networks (DNNs) promulgated into many segments, and with
cloud support from all major cloud service suppliers, deep learning (DL)
can be everywhere -- accessible any way you want it, and integrated
into every framework.
DL also will come to the Edge; IoT will be so ubiquitous that we will
need software writing software, Jensen predicted. The future of
artificial intelligence is about the automation of automation.
Nvidia's conference is all about building a pervasive ecosystem around
its GPU architectures. The ecosystem influences the next GPU iteration
as well. With early GPUs for high-performance computing and
supercomputers, the market demanded more precise computation in the form
of double precision floating-point format processing, and Nvidia was
the first to add a fp64 unit to its GPUs.
GPUs are the predominant accelerator for machine learning training,
but they also can be used to accelerate the inference (decision)
execution process. Inference doesn't require as much precision, but it
needs fast throughput. For that need, Nvidia's Pascal architecture can
perform fast, 16-bit floating-point math (fp16).
The newest GPU is addressing the need for faster neural net
processing by incorporating a specific processing unit for DNN tensors
in its newest architecture -- Volta. The Volta GPU processor already has
more cores and processing power than the fastest Pascal GPU, but in
addition, the tensor core pushes the DNN performance even further. The
first Volta chip, the V100, is designed for the highest performance.
The V100 is a massive 21 billion transistors in semiconductor company
TSMC's 12nm FFN high-performance manufacturing process. The 12nm
process -- a shrink of the 16nm FF process -- allows the use of models
from 16nm. This reduces the design time.
Even with the shrink, at 815mm2 Nvidia pushed the size of the V100 die to the very limits of the optical reticle.
The V100 builds on Nvidia's work with the high-performance Pascal
P100 GPU, including the same mechanical layout, electrical connects, and
the same power requirements. This makes the V100 an easy upgrade from
the P100 in rack servers.
For traditional GPU processing, the V100 has more than 5,120 CUDA
(compute unified device architecture) cores. The chip is capable of 7.5
Tera FLOPS of fp62 math and 13TF of fp32 math.
Feeding data to the cores requires an enormous amount of memory
bandwidth. The V100 uses second generation high-bandwidth memory (HBM2)
technology to feed 900 Gigabytes/sec of bandwidth to the chip from the
While the V100 supports the traditional PCIe interface, the chip
expands the capability by delivering 300 GB/sec over six NVLink
interfaces for GPU-to-GPU connections or GPU-to-CPU connections
(presently, only IBM's POWER 8 supports Nvidia's NVLink wire-based
However, the real change in Volta is the addition of the tensor math
unit. With this new unit, it's possible to perform a 4x4x4 matrix
operation in one clock cycle. The tensor unit takes in a 16-bit
floating-point value, and it can perform two matrix operations and an
accumulate -- all in one clock cycle.
Internal computations in the tensor unit are performed with fp32
precision to ensure accuracy over many calculations. The V100 can
perform 120 Tera FLOPS of tensor math using 640 tensor cores. This will
make Volta very fast for deep neural net training and inference.
Because Nvidia already has built an extensive DNN framework with its
CuDNN libraries, software will be able to use the new tensor units right
out of the gate with a new set of libraries.
Nvidia will extend its support for DNN inference with TensorRT --
where it can train neural nets and compile models for real-time
execution. The V100 already has a home waiting for it in the Oak Ridge
National Labs' Summit supercomputer.
Bringing DL to a wider market also drove Nvidia to build a new computer
for autonomous driving. The Xavier processor is the next generation of
processor powering the company's Drive PX platform.
This new platform was chosen by Toyota as the basis for production of
autonomous cars in the future. Nvidia couldn't reveal any details of
when we'll see Toyota cars using Xavier on the road, but there will be
various levels of autonomy. including copiloting for commuting and
"guardian angel" accident avoidance.
Unique to the Xavier processor is the DLA, a deep learning
accelerator that offers 10 Tera operations of performance. The custom
DLA will improve power and speed for specialized functions such as
To spread the DLA impact, Nvidia will open source instruction set and
RTL for any third party to integrate. In addition to the DLA, the
Xavier System on Chip will have Nvidia's custom 64-bit ARM core and the
Nvidia continues to execute on its high-performance computing roadmap
and is starting to make major changes to its chip architectures to
support deep leaning. With Volta, Nvidia has made the most flexible and
robust platform for deep learning, and it will become the standard
against which all other deep learning platforms are judged.