Cerebras has shown off the capabilities of its second–generation wafer–scale engine, announcing it has set the record for the largest AI model ever trained on a single device.
For the first time, a natural language processing network with 20 billion parameters, GPT–NeoX 20B, was trained on a single device. Here’s why that matters.
Why do we need to train models this big?
A new type of neural network, the transformer, is taking over. Today, transformers are mainly used for natural language processing (NLP) where their attention mechanism can help spot the relationship between words in a sentence, but they are spreading to other AI applications, including vision. The bigger a transformer is, the more accurate it is. Language models now routinely have billions of parameters and they are growing rapidly, without any signs of slowing down.
One key area where huge transformers are being used is in medical research in applications such as epigenomics, where they are used to model the “language” of genes — DNA sequences.
Why does it matter that this was done on a single device?
Huge models today are mostly trained using many–processor systems, usually GPUs. Cerebras says its customers have found partitioning huge models across hundreds of processors to be a time–consuming process, which is unique to each model and each specific multi–processor system, based on the model’s properties and the characteristics of each processor (ie, what kind of processor it is and how much memory it has) and characteristics of the I/O network. This work is not portable to other models or systems.
Typically for multi–processor systems, there are three types of parallelism at play:
- If the model fits on a single processor, it can be duplicated onto other processors and each one trained with subsets of the data — this is called data parallelism, which is relatively straightforward.
- If the model doesn’t fit on one processor, the model can be split between processors with one or more layers running on each — this is called pipelined model parallelism. However, the layers need to run sequentially, so the user has to manually evaluate how much memory and I/O will be required for each layer to make sure there are no bottlenecks. It’s more complicated than data parallelism.
- If a layer of the model is so huge that it doesn’t fit on one processor, it’s even more complicated still. Tensor model parallelism must be used to split layers across processors, adding another dimension of complexity which also strains memory and I/O bandwidth.
Huge models, such as the GPT–NeoX 20B in Cerebras’ announcement, require all three types of parallelism for training.
A breakdown of the types of parallelism used to train huge models today (Source: Cerebras)
Cerebras’ CS–2 avoids the need to parallelize the model, partly because of its processor’s sheer size — it is effectively one huge 850,000–core processor on a single wafer–sized chip big enough for even the biggest network layers — and partly because Cerebras has disaggregated memory from compute. More memory can be added to support more parameters without needing to add more compute, keeping the architecture of the compute part of the system the same.
Without the need to use parallelism, there is no need to spend time and resources manually partitioning models to run on multi–processor systems. Further, without the bespoke part of the process, models become portable. Changing between GPT models with several parameters involves changing merely four variables in one file. Similarly, changing between GPT–J and GPT–Neo took only a few keystrokes. According to Cerebras, this can save months of engineering time.
What are the implications for the wider industry?
NLP models have grown so large that, in practice, only a handful of companies have adequate resources — in terms of both the cost of compute and engineering time — to train them.
Cerebras hopes that by making its CS–2 system available in the cloud, as well as by helping customers reduce the amount of engineering time and resources needed, it can open up huge model training to many more companies, even those without huge system engineering teams. This includes accelerating scientific and medical research as well as NLP.
A single CS–2 can train models with hundreds of billions or even trillions of parameters, so there is plenty of scope for tomorrow’s huge networks as well as today’s.
Does Cerebras have real–world examples?
Biopharmaceutical company AbbVie is using a CS–2 for its biomedical NLP transformer training, which powers the company’s translation service to make vast libraries of biomedical literature searchable across 180 languages.
“A common challenge we experience with programming and training BERT LARGE models is providing sufficient GPU cluster resources for sufficient periods of time,” said Brian Martin, head of AI at biopharmaceutical company AbbVie, in a statement. “The CS–2 system will provide wall–clock improvements that alleviate much of this challenge, while providing a simpler programming model that accelerates our delivery by enabling our teams to iterate more quickly and test more ideas.”
GlaxoSmithKline used the first–generation Cerebras system, the CS–1, for its epigenomics research. The system enabled training a network with a dataset that otherwise would have been prohibitively large.
“GSK generates extremely large datasets through its genomic and genetic research, and these datasets require new equipment to conduct machine learning,” said Kim Branson, SVP of Artificial Intelligence and Machine Learning at GSK, in a statement. “The Cerebras CS–2 is a critical component that allows GSK to train language models using biological datasets at a scale and size previously unattainable. These foundational models form the basis of many of our AI systems and play a vital role in the discovery of transformational medicines.”
Other Cerebras users include TotalEnergies, who use a CS–2 to speed up simulations of batteries, biofuels, wind flow, drilling, and CO 2 storage; the National Energy Technology Laboratory accelerates physics–based computational fluid dynamics with a CS–2; Argonne National Laboratory has been using a CS–1 for Covid–19 research and cancer drugs; and there are many more examples.
>> This article was originally published on our sister site, EE Times.
- Training AI models on the edge
- In-memory compute enables more efficient edge processing
- TOPS vs. real world performance: Benchmarking performance for AI accelerators
- Developing training sets for the IoT
For more Embedded, subscribe to Embedded’s weekly email newsletter.