October 20, 2021
Nowadays, many Google services are almost all related to AI, such as search, maps, photos, translation, etc. These AI application services use Google's TPU in the process of training, learning and inference. Google has long deployed a large number of TPUs in data centers to accelerate AI model training and inference deployment, not only for its own use, but later as a cloud computing service or providing third-party use, and turning it into product sales.
At this year's online Taiwan Artificial Intelligence Annual Conference, Cliff Young, a software engineer from Google's research department, served as the keynote speech on the first day. Cliff Young is not only a core member of the Google Brain team, but also the main designer of Google's TPU chip, designing and building TPUs. , Deployed in the Google data center, as an AI hardware accelerator, used for various AI model training or inference. Before joining Google, he was responsible for the design and construction of laboratory supercomputers at DE Shaw Research and Bell Labs. In the whole speech, he not only personally exposed Google's decision to develop TPU on its own, but also presented his latest observations regarding the impact of the deep learning revolution on the future development of AI.
Cliff Young said that since deep learning neural network technology began to shine in speech recognition in 2009, it has been applied in different fields almost every year. We can see breakthrough developments due to deep learning. From AI image recognition , Al playing chess, going to Al retinopathy interpretation, language translation, robot picking, etc. "This is something we have never thought of before."
It is precisely because the emergence of deep learning has caused major changes in the way humans perform tasks in different fields. He also described the development model of the scientific revolution proposed by the well-known American philosopher of science Thomas Kuhn. Deep learning itself is a kind of scientific revolution. Paradigm shift is not just normal science.
Thomas Kuhn proposed two models of scientific development in the book "The Structure of the Scientific Revolution". The first is the normal science model, which is a method of understanding new facts through experiments and proofs. When there is a new science that cannot be recognized by the old science At this time, there will be conflicts between the old and new sciences, and another model will be developed, that is, the second model of the scientific revolution. Under this model, the new scientific model will completely overthrow the practice of the old science. "I think the deep learning revolution is just such a shift, replacing traditional computer science," Cliff Young said.
Furthermore, he pointed out that deep learning is a data-driven decision-making process. Unlike traditional stored value or heuristic decision-making methods, deep learning algorithms use observable data to provide humans with better Decision-making methods, such as user recommendation, can recommend suitable products or give the best search results based on user profiles or online behavior.
But he also admitted that, unlike mathematical principles that can be explained, the operating principles of deep learning models are still difficult to explain. Therefore, scientists cannot find a better way to improve efficiency from these reasons why it is feasible. But if you want to fully understand and explain the operation principle of deep learning, according to the development experience of the industrial revolution in the past, you have to wait until the emergence of synthetic neural dynamics before you have a chance to be explained. It may take decades at all, so he also said : "Being engaged in deep learning research is more important than asking why, how."
Cliff Young reviewed the process of machine learning revolution. The 2012 AlexNet neural network architecture can be used as a watershed. The AlexNet proposed by Alex Krizhesky et al. uses GPU to build a deep learning model, which refreshes the world record with 85% accuracy. ImageNet image classification in the same year Won the championship in one fell swoop.
This competition later also attracted great attention from Google, thinking that deep learning technology has great potential, and began to invest in research. However, after investing, they found that the performance of deep learning models in image recognition and classification highly depends on the floating-point computing capabilities of GPUs, which require a large amount of computing resources for AI model learning and training, because the computing costs derived from the use of GPU for model training are very high. expensive. Therefore, Google resolutely decided to develop its own dedicated processor chip for deep learning, that is, TPU (Tensor processing unit).
After 3 years of deep learning research, in 2015, Google developed the first-generation TPU processor and began to deploy it in its own data center for deep learning model training.
Google revealed TPU for the first time at the Google I/O Conference in 2016. Compared with CPUs and GPUs at the time, Google’s TPU not only provides 30-100 times higher floating-point computing performance per second, but also has as much as overall computing performance. An improvement of 15 to 30 times, and even an improvement of nearly 30 to 80 times in the performance/power ratio. Cliff Young said that TPU was probably the first in the world to achieve a high memory capacity matrix architecture design. Device.
Back then, Google defeated AlphaGo, the AI computer Go program of South Korean chess king Lee Sedol. The hero behind it was the server cabinet using TPU operations. The Google Deepmind team used 48 TPUs in AlphaGo for AlphaGo’s AI chess inferences to compete with humans.
So far, Google TPU has undergone a total of 4 generations of development and evolution. From the first generation of TPUs that can only be used for inference, to the second generation of TPUs, the deep learning model training and processing capabilities have been added. The demand for network throughput has increased, and with the increase in computing power To improve, considering the heat dissipation problem, the new generation of TPU began to combine the liquid cooling design in the heat dissipation mechanism, that is, the 3rd generation TPU, and therefore increased the TPU density. In the latest 4th generation TPU, the liquid-free TPU v4i and Two different versions of TPU v4 with decentralized liquid cooling.
In recent years, deep learning hardware accelerators have become more and more popular. Cliff Young believes that there will be changes in the development of AI training and inference hardware. He predicts that in the future, inference hardware will be more diverse in design, and different inference solutions can be developed to meet the needs of different scenarios, from ultra-low power consumption of microwatts to high-performance computing HPC and supercomputer applications.
In terms of AI training hardware, he said that converged hardware architecture will become the mainstream of development, especially now that many newly launched AI training hardware have many similarities, such as when designing a high-density computing chip Die. It adopts HBM (high-bandwidth memory) integrated design, and establishes a high-performance interconnection network for transmitting training data. For example, TPU uses ICI (Inter-Core Interconnect) to interconnect with other TPUs at high speed, and others such as Nvidia's NVLink High-speed interconnection interface, etc. Although these technologies are from different teams, they all have one thing in common, that is, they are studying common problems to find solutions. He said that through the integration of these technologies, there are opportunities to find good solutions.
Many TPU Pods are currently deployed in Google's data center
On the other hand, he also observed that in recent years, the global AI competition has entered a white-hot stage. Although the rapid progress of deep learning in natural language models has been accelerated, the AI models that need to be trained have also become larger and larger, as if to complete the use. For training the GPT-3 text generator model with 175 billion neural parameters, OpenAI used 10,000 GPUs to build a computing cluster, and it took 3,640 days to train the model with petaflops.
In order to train a super-large AI model like GPT-3, Google also builds a TPU Pod cluster with multiple TPU interconnections to build a supercomputer cluster and put it in its own data center to accelerate the training of the AI model. In the past few years, Google TPU Pod has increased from 256 TPUs to 1,024, and now a Pod has as many as 4,096 computing nodes. The above figure is an example of the TPU Pod used in the Google data center. TPU Pod clusters are built in two layers. Each layer is equipped with multiple rack cabinets. Each cabinet is installed with dozens of TPUs, including TPU v2 and TPU v3. And then through the network line to interconnect with other TPU high-speed.
But in order to keep up with the development of deep learning, Cliff Young believes that it is not only necessary to increase the computing machines used for training. The existing software and hardware architecture design must also be changed. He proposed the concept of materials-application codesign collaborative design. He believes that the design of deep learning architecture in the future needs to incorporate collaborative design covering all levels from physics to application. He believes that this is a way to break the bottleneck of Moore's Law and find a new way out for the development of deep learning.
He further explained that in traditional collaborative design, only a thin layer of ISA instruction set architecture is used to communicate between hardware and software, but in collaborative design based on domain-specific architecture (DSA), then It is composed of many different software layers, architecture layers, and different interfaces. Among them, the software layer includes libraries, compilers, applications, models, algorithms, Numerics, etc., and the hardware architecture includes physical design, semiconductor materials, architecture, and microarchitecture. These software and hardware co-designs can be used in the design and optimization of deep learning architectures in the future, such as in memory technology, which can greatly reduce the use of bit overrides in the model training process, and add the use of slower memory reads. Take the design of the speed, etc.