Inside BLOOM: How Thousands of AI Researchers Created an Open Source ChatGPT Alternative
Was this email forwarded to you? Sign up here. You can also give it as a gift. Inside BLOOM: How Thousands of AI Researchers Created an Open Source ChatGPT AlternativeAn open-source LLM shows that tech incumbents are not the only companies able to create massive models.When we think about large language models(LLMs) alternatives to ChatGPT, we tend to think about projects from large AI labs or ultra-well-financed startups. But what happens when a large number of AI researchers decide to collaborate to make LLMs available to mainstream researchers? The result is BLOOM, an open source 176 billion parameters LLMs that is able to master tasks in 46 languages and 13 programming languages. The development of BLOOM was coordinated by BigScience, a vibrant open research collaboration with a mission to publicly release an LLM. The project was brought to life after being awarded a computing grant by GENCI on its Jean Zay supercomputer at IDRIS/CNRS. The project was founded by Hugging Face and the French NLP community and soon attracted a diverse international collaboration with a goal to support linguistic, geographical, and scientific diversity. Over 1200 participants from 38 countries, including experts in machine learning, computer science, linguistics, statistics, socio-cultural anthropology, philosophy, law, and other fields, registered with BigScience and were given access to its communication channels. The organization of the BigScience effort, reflecting the related research questions tackled, was structured into 30 working groups, each comprising several participants with various levels of involvement and chairs tasked with self-organizing around specific aspects of the overall project. Participants were encouraged to join multiple working groups to share experiences and information, leading to a dynamic and collaborative environment. The majority of the working groups focused on tasks directly linked to the development of BLOOM. BLOOMThe BLOOM architecture is based on causal-decoder transformer models. This type of architecture is pretty standard for LLMs above 100B parameters as they have shown the best performance. Beyond the architecture of choice, BLOOM introduced a couple of key innovations to standard causal-decoder models. I. ALiBi Positional Embeddings: Instead of adding positional information to the embedding layer, ALiBi leverages a unique approach by directly attenuating the attention scores based on the distance between the keys and queries. The initial motivation for ALiBi was its ability to extrapolate to longer sequences, but the researchers were thrilled to discover that it also led to smoother training and improved downstream performance, even outperforming both learned and rotary embeddings. II. Embedding LayerNorm: In preliminary experiments on a massive 104B parameters model, the team experimented with an additional layer normalization immediately after the embedding layer. This significantly improved training stability. Nonetheless, Bigscience decided to train BLOOM with an additional layer normalization after the first embedding layer to avoid training instabilities. It is worth noting that the preliminary experiments were conducted in float16, while the final training was in bfloat16. Since then, float16 has been identified as the cause of many observed instabilities in training LLMs, and it is possible that bfloat16 alleviates the need for the embedding LayerNorm. BLOOM was trained on the ROOTS corpus, which includes 498 Hugging Face datasets that cover 46 languages and 3 programming languages. The training process includes data sourcing and processing stages. From the infrastructure standpoint, BLOOM was brought to life through the power of Megatron-DeepSpeed20, a cutting-edge framework for large-scale distributed training. This framework is a dynamic fusion of two parts: Megatron-LM21, which provides the Transformer implementation, tensor parallelism, and data loading primitives, and DeepSpeed22, which brings the ZeRO optimizer, model pipelining, and generally distributed training components to the table. The Megatron-DeepSpeed20 framework allows for efficient training with 3D parallelism, a combination of three complementary approaches to distributed deep learning. These approaches are: · Data parallelism (DP): This approach replicates the model multiple times and places each replica on a different device, where it is fed a slice of the data. The processing is done in parallel, and all model replicas are synchronized at the end of each training step. · Tensor parallelism (TP): This approach partitions individual layers of the model across multiple devices. This way, instead of having the whole activation or gradient tensor reside on a single GPU, shards of this tensor are placed on separate GPUs, making it possible to perform horizontal parallelism or intra-layer model parallelism. · Pipeline parallelism (PP): This approach splits up the model’s layers across multiple GPUs so that only a fraction of the layers of the model are placed on each GPU. This technique is sometimes called vertical parallelism. Finally, the Zero Redundancy Optimizer (ZeRO) allows different processes to only hold a fraction of the data (parameters, gradients, and optimizer states) required for a training step. The team used ZeRO stage 1, meaning that only the optimizer states are shared in this manner. With the combination of these four components, BLOOM was able to scale to hundreds of GPUs with extremely high GPU utilization. The team was thrilled to achieve 156 TFLOPs in their fastest configuration with A100 GPUs, hitting their objective with flying colors! A Special LLMBLOOM is a very special model in the LLM space. It shows that LLMs are not an exclusive domain of large AI labs, and when a large community of AI researchers comes together, magical things can happen! TheSequence is a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. We keep you up-to-date with the main AI news, trends, and technology developments. This post is only for paying subscribers of TheSequence Edge. You can give it as a gift. |
Older messages
💡TOMORROW: Chip Huyen & Kevin Stumpf on Making the Jump to Real-Time ML
Wednesday, February 22, 2023
Real-time ML is increasingly being adopted to power new applications across use cases in multiple industries. But for most companies, moving to real-time ML is a huge undertaking. It requires a shift
Who Has The Vision?
Sunday, February 12, 2023
On Sunday, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space
Edge 267: A Summary of our Machine Learning Interpretability Series
Tuesday, February 7, 2023
11 issues that cover the fundamental topics in machine learning interpretability.
The ChatGPT Challengers
Sunday, February 5, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
💡Share Your Thoughts on Applied ML for a $25 Amazon Gift Card*
Friday, February 3, 2023
As a member of the ML community, we'd love for you to participate in our industry survey—it'll only take 10 minutes, and the first 150 respondents will receive a $25 Amazon gift card! Your
You Might Also Like
AI + high-stakes poker + Google's prompt cheat sheet
Tuesday, October 8, 2024
and a google prompt cheat sheet ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
👍 How My Phone Became My Favorite Game Console — Why Desktop Linux Matters
Tuesday, October 8, 2024
Also: iPhone Mirroring Is Here and Mostly Works, and More! How-To Geek Logo October 8, 2024 Did You Know At the end of the song "Sweet Child O' Mine," found on Guns N' Roses'
Software Testing Weekly - Issue 240
Tuesday, October 8, 2024
How Sonos Lost $200M: A Hard Lesson in Quality 🚨 View on the Web Archives ISSUE 240 October 8th 2024 COMMENT Welcome to the 240th issue! Back in June, I shared with you about the big problem with a new
Immutable Types, DuckDB & Pyodide, Free Threaded, and More
Tuesday, October 8, 2024
Differences Between Python's Mutable and Immutable Types #650 – OCTOBER 8, 2024 VIEW IN BROWSER The PyCoder's Weekly Logo Differences Between Python's Mutable and Immutable Types In this
Ranked | The Costliest Hurricanes to Hit the U.S. ☔
Tuesday, October 8, 2024
As of 2023, Hurricane Katrina is the costliest natural disaster in US history, causing over $200 billion in damages in 2024 dollars. View Online | Subscribe | Download Our App Presented by: NEW REPORT:
Daily Coding Problem: Problem #1572 [Easy]
Tuesday, October 8, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Yelp. Given a mapping of digits to letters (as in a phone number), and a digit string,
The Race for Server Space
Tuesday, October 8, 2024
Apple's Leak, Disney's Star Wars, Google's Epic Fail, OpenAI's Space Race The Race for Server Space Apple's Leak, Disney's Star Wars, Google's Epic Fail, OpenAI's Space
Microsoft goes Go for SQL Server's CLI
Tuesday, October 8, 2024
Plus new ways to deploy Go apps, reflecting on reflection, and Windows gets high resolution timers in Go. | Together with Frontend Masters logo #526 — October 8, 2024 Unsub | Web Version Go Weekly
⚙️ Nvidia's new Agents
Tuesday, October 8, 2024
Plus: Chipmaker delivers 100k GPUs
How Does Visual Capitalist Work With Clients? 💪
Tuesday, October 8, 2024
Here's how organizations can partner with Visual Capitalist to leverage world-class data storytelling, and its strong audience and reach. View Online | Subscribe | Download Our App For 13 years,