The last two years have been extremely busy in the inferencing chip business. For a while, it seemed like every other week another company introduced a new and better solution. While all this innovation was great, the problem was that most companies didn’t know what to make of the various solutions because they could not tell which one performed better than another. With no set of established benchmarks in this new market, they either had to get up to speed really quickly on inference chips, or they had to believe the performance figures provided by the various vendors.
Most vendors provided some type of performance figure and usually it was whatever benchmark made them look good. Some vendors talked about TOPS and TOPS/Watt without specifying models, batch sizes or process/voltage/temperature conditions. Others used the ResNet-50 benchmark, which is a much simpler model than most people need, so its value in evaluating inference options is questionable.
We’ve come a long way from those early days. Companies have slowly figured out that what really matters when measuring the performance of inference chips is 1) high MAC utilization, 2) low power and 3) you need to keep everything small.
We know how to measure — what’s next?
Now that we have a fairly good idea of how to measure the performance of one inference chip over another, companies are now asking what the advantages (or disadvantages) are to using multiple inference chips together in the same design. The simple answer is that using multiple inference chips, when the inference chip is designed the right way, can deliver linear increases in performance. The analogy of a highway is not far off when we look at using multiple inference chips. Does a company want the performance of a one-lane highway or a four lane highway?
Clearly, every company wants a four-way highway, so the question now becomes “how do we deliver this four-lane highway without creating traffic and bottlenecks?” The answer relies on choosing the right inferencing chip. To explain, let’s take a look at a neural network model.
Neural networks are broken down into layers. Layers such as ResNet-50 has 50 layers, YOLOv3 has over 100 and each layer takes in an activation from the previous layer. Thus, in layer N, the output of it is an activation that goes into layer N+1. It waits for that layer to come in, a computation is done, and the output is activations that go to layer n+2. That continues for the length of the layers until you finally get a result. Keep in mind that the initial input of this example is an image or whatever data set is is being processed by the model.
When multiple chips make a difference
The reality is that if you have a chip that has a certain level of performance, there is always going to be a customer that wants twice as much performance or four times more performance. If you analyze the neural network model, it is possible to achieve that in some cases. You just need to look at how you split the model between either two chips or four chips.
This has been a problem with parallel processing over the years because it’s been difficult to figure out how to partition whatever processing you are doing and make sure it is all added up, as opposed to being subtracted when it comes to performance.
Unlike parallel processing and general-purpose computing, the nice thing with inference chips is that customers typically know in advance if they want to use two chips so the compiler doesn’t have to figure it out on the fly — it’s done at compile time. With neural network models, everything is totally predictable so we can analyze and figure out exactly how to split the model and whether it will run well on two chips.
To make sure the model can run on two or more chips, it is important to look at both the activation size and number of MACs layer by layer. What typically happens is that the biggest activations are in the earliest layers. That means the activation sizes slowly go down as the number of layers go on.
It’s also important to look at the number of MACs and how many MACs are done in each cycle. In most models, the number of MACs done in each cycle generally correlates with the activation sizes. This is important because if you have two chips and you want to run at maximum frequency, you need to give equal workloads to each chip. If one chip is doing most of the model and the other chip is doing only a little of the model, you are going to be limited by the throughput of the first chip.
How you split the model between the two chips is also important. You need to look at the number of MACs because that determines the distribution of the workload. You also have to look at what gets passed between the chips. At some point, you need to slice the model at a place where the activation that you passed is as small as possible so that the amount of communications bandwidth required and the latency of the transmission is minimal. If you slice the model at a point where the activation is very large, the transmission of the activation can become the bottleneck which limits the performance of the two-chip solution.
The chart below shows for YOLOv3, Winograd, 2 Megapixel images the Activation Output size and the cumulative Mac Operations layer by layer (convolution layers are plotted). To balance workload between two chips the model will be cut approximately at 50% cumulative MAC Operations — at this point the activations to pass from one chip to the other are 1MB or 2MB. To split between 4 chips, the cuts are at approximately 25%, 50% and 75%; notice the activation sizes are largest at the start, so the 25% cut point has 4 or 8MB activations to pass.
Click here for larger image
Activation output size (blue bars) and cumulative MAC operations layer by layer (red line) for YOLOv3/Winograd/2Mpixel images, showing how the workload is split between multiple chips (Image: Flex Logix)
Fortunately, performance tools are now available to ensure high throughput. In fact, the same tool that models the performance of a single chip can then be generalized to model the performance of two chips. While the performance of any given layer is exactly the same, the issue is how does the transmission of data affect performance. The modeling tool needs to factor this in because if the bandwidth required is not enough, that bandwidth will limit the throughput.
If you are doing four chips, you will need bandwidth that is bigger because the activations in the first quarter of the model tend to be bigger than the activation in the later part of the model. Thus, the amount of communications resources you invest in will allow you to go to larger numbers of chips pipelined together, but that will be an overhead cost that all chips have to bear even if they are standalone chips.
Using multiple inferencing chips can deliver significant improvements in performance, but only when the neural network is designed correctly as described above. If we look back to the analogy of the highway, there are many opportunities to let traffic build up by using the wrong chip and the wrong neural network model. If you start with the right chip, you are on the right track. Just remember that throughput, not TOPS or Res-Net50 benchmarks, is what matters most. Then once you select the right inference chip, you can design an equally powerful neural network model that provides maximum performance for your application needs.
— Geoff Tate is the CEO of Flex Logix
>> This article was originally published on our sister site, EE Times.