November 19, 2020
Driven by the rapid development of artificial intelligence and machine learning chips designed for a variety of completely different end markets and systems, the number of memory/architectures people can choose has exploded.
In these systems, the model parameter size of some systems may be between 10 billion and 100 billion, and the difference between chips or applications may be very large. The training and inference of neural networks is one of the most complex workloads today, which makes it difficult to find the optimal storage solution. These systems consume a lot of computing resources (mainly using multiplication and accumulation operations) and a lot of memory bandwidth. The deceleration of any part of the system may affect the entire system.
Steve Roddy, vice president of product marketing for the Arm Machine Learning Group, said: "A series of technologies have been deployed to reduce the complexity of neural network models and the memory requirements of neural network models. For example, quantification, network simplification, pruning, clustering and Model compression to reduce the size of the model as much as possible. When the device is running, cross-layer reuse of intermediate values through intelligent scheduling can also reduce memory traffic, thereby speeding up inference runtime."
This puts tremendous pressure on memory developers, requiring them to provide as much bandwidth as possible with the lowest power consumption, area, and cost. This trend currently shows no signs of abating. The scale of neural network models is increasing year by year, and in order to train these models, the data set is also growing.
"The size of these models and the size of the training set are increasing at a rate of about an order of magnitude every year," said Steven Woo, an outstanding inventor and researcher from Rambus. "At the beginning of this year, when the latest natural language processing model came out, it had about 17 billion parameters. That number is big enough, but an updated version appeared this summer and the number of parameters rose to 175 billion. In other words, in about seven months, the number of parameters has increased by a full 10 times."
The neural network models of the 1980s and early 1990s had approximately 100 to 1,000 parameters. "If I have a larger model, I need more samples to train it, because every parameter must be adjusted," StevenWoo said. "For those who are impatient in the technical field, when you have more data, you don’t want to wait that long on training. The only way out is to have more bandwidth. You must be able to transfer This data is pushed into the system and extracted faster. Bandwidth is the top priority."
Another issue is energy. Woo said: "If all you have to do is to double the performance and double the power consumption, then life will be good. But this is not how it works, you actually need to care about power very much. The problem, because your wall outlet can only handle so much power. The fact is that people do want to increase performance by X times, but at the same time they want to increase energy efficiency by 2X times. This is what makes things difficult. ."
This trade-off is more difficult in AI reasoning applications. "Today, the gap between AI/ML training and reasoning is getting bigger and bigger," said Marc Greenberg, Director of Product Marketing, CadenceIP Group. "Training requires maximum memory bandwidth and is usually performed on powerful server-type machines or very high-end GPU cards. In the training field, we found that high-level training uses HBM memory, while low-level training uses GDDR6 Memory. HBM memory is particularly good at providing the highest bandwidth with the lowest single-bit energy. HBM2/2E memory can provide up to 3.2/3.6TB of memory bandwidth between the AI/ML processor and each memory stack, and will be released soon The HBM3 standard is expected to provide higher bandwidth."
Of course, this performance comes at a price. Greenberg pointed out: “As a high-end solution, HBM has a matching high price. It is understandable that HBM may continue to be deployed in server rooms and other high-end applications. GDDR6 technology can help reduce costs. Devices can provide 512Gbit/s per second at a data rate of 16Gbps through technology, and faster data rates will appear in the future. It is also common for users to place multiple devices in parallel. For example, some graphics cards can be used in parallel. One or more GDDR6 components to achieve 5Tbps speed or even higher bandwidth."
Inference technology is still evolving, which is especially evident in edge computing. Greenberg said: “For AI reasoning, what we see in the new design is mainly GDDR6 and LPDDR5 memory. They provide more moderate bandwidth at a more appropriate cost, making it possible to deploy AI at the edge of the cloud in real time without having to The data is sent back to the server."
Many AI machines under development now use very regular layouts and structures that have been carefully planned.
He said: "If you think back to the SoC design era, you will find that there are actually a lot of randomness in chip design. The heterogeneity of these chips is very obvious. They deploy many different functions, and many of them are heterogeneous functions. This It makes the chip look like a locker where different blocks are mixed together. However, when you look at the AI chip, you will see a very regular structure, because this method can guarantee the entire chip A large amount of data is managed in a very parallel data stream. It is different from the architecture we do in SoCs and even many CPUs. Its architecture design is mainly built around how to transmit data through the chip."
All of these directly affect the choice of memory, especially DRAM, which was predicted to be phased out a few years ago. But in fact, the situation is quite the opposite. There are more options today than ever before, and each option has its own different price.
Synopsys DDR product technical marketing manager Vadhiraj Sankaranarayanan said: "For example, we are in the stage of transitioning the DDR standard from DDR4 to DDR5." Customers who come to DDR4 requirements may also want to obtain DDR5 because their products have a long service life. stand by. Similar to LPDDR5, in addition to providing higher performance, these new standards also have advantages in terms of power consumption. Because these standards can operate at lower voltages, they can reduce power consumption and have advantages in RAS (reliability, availability, and maintainability). In terms of functionality, due to its high speed, DRAM itself will be equipped with a function that can correct single-bit errors that may occur anywhere in the subsystem. "
It is necessary to provide so many memory options because in today's AI/ML applications, the memory configuration may be very different. Sankaranarayanan said: "We have seen that design teams use LPDDR in addition to HBM, but this actually depends on bandwidth requirements. There are also some cost factors to consider. When using HBM, it is necessary to combine multiple silicon pass-through technologies. DRAM dies are stacked together-and interposers are used to put DRAM and SoC into the same SoC package, which requires multiple packaging steps, resulting in high cost of HBM today. However, with AI-related applications and requirements The increase in HBM prices will become value for money in the near future."
Power comes first
Not surprisingly, power management is the primary consideration in AI/ML applications. This is true for data centers and edge devices.
In SoC, the power allocated to memory can be divided into three parts.
Rambus’s Woo said: “The first is to get the power consumed by the bit from the DRAM core. You can’t make a fortune in this regard. You have to read the data bit from the DRAM core to play its role. Second, it is related to mobile data. Power is the power associated with the circuits at both ends of the data line. Third, there are SoCPHY and interfaces in DRAM. It turns out that these memories themselves consume only one-third of the power, and the other two-thirds of the power is used Moving data back and forth between the DRAM and SoC chips is a bit scary because it means that reading data from the DRAM core-what you have to do-is not the main factor controlling power. Try to increase When it comes to power efficiency, if you consider stacking these things together, you can save a lot of this power consumption. HBM devices do this. If you consider stacking SoC and DRAM, those power related to data communication It may drop many times or even an order of magnitude. That’s where you might save power."
There is no free lunch here. Woo said: "If you want to do this, you will now be more limited by the power of the DRAM core, and you must consider how to reduce the power of the DRAM core to make the overall size smaller."
This is an ongoing research area, but the technical route of the solution is still unclear. As more and more bits are placed on the chip, they will become smaller and smaller, so one bit cannot hold too many electrons. As a result, it is difficult to determine whether this bit is 1 or 0, and the time for them to maintain the desired electronic state may be reduced, so they need to be refreshed more frequently.
New materials and new cell designs may help. Another option may be to manage the power of the PHY, but there is a cyclic dependency between everything, so the challenge that the PHY faces is indeed very difficult.
Woo said: "As the speed increases, more work needs to be done to ensure the correct transmission of data. There is a tug of war here. This is similar to auctioneers, who have to speak loudly. The same is true on PHY. In order to continue to distinguish the signal, you must have the appropriate amplitude, so one of the challenges is how to set the correct amplitude, avoid the ambiguity of the signal, to ensure that the other party receives exactly what you sent. In order to clearly indicate Symbols that communicate back and forth on data lines require a lot of work. There are other techniques that try to reduce the amplitude, but they are all compromises. Usually, people don’t want to change their infrastructure. If other conditions remain the same, people will choose Incremental improvement rather than revolutionary improvement. This is where the challenge lies."
On-chip memory and off-chip memory
Today, another major compromise in the AI/MLSoC architecture is where to put the memory. Although AI chips often have on-chip memory, off-chip memory is essential for AI training.
Cadence's Greenberg said: "The main question is how much data you want to store for the neural network. Each neuron needs a certain amount of storage space. Everyone wants to use on-chip memory, as long as you can use on-chip memory, you definitely want to Use it. Its speed is super fast, power consumption is super low, but it is expensive. With a certain budget, every square millimeter of memory you put on the main chip means that there are fewer logic and other functions on the chip An area of one square millimeter."
On-chip memory is very expensive because it is essentially manufactured using a logic process. He said: "Assuming that the logic process I am using is a 7nm or 5nm process, it will be very expensive to manufacture a 16-layer memory. If you can manufacture memory on a discrete chip, you can optimize the memory process for cost targets. Not only There is no need to limit the number of metal layers, and the cost per square millimeter of discrete memory chips is much lower than the cost per square millimeter of 7nm or 5nm logic chips."
Most AI/ML engineering teams are struggling to choose between on-chip storage or off-chip storage, because these designs are still in the early stages of their life cycle. Greenberg said: "Everyone wants to use on-chip memory at the beginning. There is really no standard that can be consulted. In most AI chips, the layout is actually different. The industry has not yet decided on the best AI. Therefore, we are basically still in the experimental stage of AI chip architecture and are moving in a direction that most people may adapt to. Therefore, today’s AI design is still very open, and we can see a lot of innovation. So, How to recommend the type of memory? In fact, the question comes back to some key parameters when everyone is viewing the memory. How much memory do you need? How many GB of data do you need to store? How fast do you want to achieve? How much PCB area can you arrange? How much do you want to spend? Everyone’s method of finding the optimal solution will be more or less different."
These decisions will affect all aspects of AI/ML chips, including dedicated accelerators. The main choices there depend on performance, power, and area, and the boundaries between cloud computing and edge computing chips are clear.
Suhas Mitra, Director of Product Marketing, CadenceTensilica Group, said: "Cloud computing and edge computing are very different. They have similarities, but the bigger difference is. If you are designing processors for data center clouds, then How to store, what kind of memory hierarchy to choose, how to place the memory, etc., power and area are very meaningful."
For edge computing, the complexity of the trade-off continues to increase, and a fourth variable-bandwidth is added to the traditional PPA (power, performance, area) formula. Mitra said: "The scope of discussion should be about PPAB, because we need to constantly weigh and adjust these four factors. In AI/ML processor design or accelerator design, how to determine the power, performance, area, bandwidth trade-off , To a large extent depends on the nature of the workload. Fundamentally speaking, when you talk about edge computing, when considering how much performance is achieved in a limited area, you must also consider energy efficiency, or consumption With this power, how much performance can I get? We have been paying attention to these indicators."
He pointed out that this is why people spend so much time on the memory interface. For processor/accelerator designers, these considerations take different forms. "The form is basically related to the AI workload. How can I ensure that when my workspace is small, I can perform calculations in a very efficient manner? The reason for this is that you cannot sacrifice too much area or too much power for calculations. .What is the best configuration to achieve this workload? You can look at different workloads and try to figure out what it should be, how many frames per second, how many frames per watt per second, and per square millimeter per second How many frames should there be."
The architecture of artificial intelligence is still in a state of rapid evolution. Everyone is guessing when it will stabilize and whether it will stabilize, which makes it more difficult to judge whether your choice is correct and how long it will last.
"Are you on the right path? The question is clear, but there are many different answers." Mitra said. "In traditional processor design, if I design it in this way, it looks like this. Therefore, everyone designs the processor IP, and people also design some variants, such as VLIW and superscalar. . But, there will never be only one design that wins in the end. You will find that there are many designs that can win. This is almost like saying that you are given 40 options, not a solution. Looking ahead, you will see People will make more of these architectural choices because AI has many different meanings for different verticals."