Inside BLOOM: How Thousands of AI Researchers Created an Open Source ChatGPT Alternative
Was this email forwarded to you? Sign up here. You can also give it as a gift. Inside BLOOM: How Thousands of AI Researchers Created an Open Source ChatGPT AlternativeAn open-source LLM shows that tech incumbents are not the only companies able to create massive models.When we think about large language models(LLMs) alternatives to ChatGPT, we tend to think about projects from large AI labs or ultra-well-financed startups. But what happens when a large number of AI researchers decide to collaborate to make LLMs available to mainstream researchers? The result is BLOOM, an open source 176 billion parameters LLMs that is able to master tasks in 46 languages and 13 programming languages. The development of BLOOM was coordinated by BigScience, a vibrant open research collaboration with a mission to publicly release an LLM. The project was brought to life after being awarded a computing grant by GENCI on its Jean Zay supercomputer at IDRIS/CNRS. The project was founded by Hugging Face and the French NLP community and soon attracted a diverse international collaboration with a goal to support linguistic, geographical, and scientific diversity. Over 1200 participants from 38 countries, including experts in machine learning, computer science, linguistics, statistics, socio-cultural anthropology, philosophy, law, and other fields, registered with BigScience and were given access to its communication channels. The organization of the BigScience effort, reflecting the related research questions tackled, was structured into 30 working groups, each comprising several participants with various levels of involvement and chairs tasked with self-organizing around specific aspects of the overall project. Participants were encouraged to join multiple working groups to share experiences and information, leading to a dynamic and collaborative environment. The majority of the working groups focused on tasks directly linked to the development of BLOOM. BLOOMThe BLOOM architecture is based on causal-decoder transformer models. This type of architecture is pretty standard for LLMs above 100B parameters as they have shown the best performance. Beyond the architecture of choice, BLOOM introduced a couple of key innovations to standard causal-decoder models. I. ALiBi Positional Embeddings: Instead of adding positional information to the embedding layer, ALiBi leverages a unique approach by directly attenuating the attention scores based on the distance between the keys and queries. The initial motivation for ALiBi was its ability to extrapolate to longer sequences, but the researchers were thrilled to discover that it also led to smoother training and improved downstream performance, even outperforming both learned and rotary embeddings. II. Embedding LayerNorm: In preliminary experiments on a massive 104B parameters model, the team experimented with an additional layer normalization immediately after the embedding layer. This significantly improved training stability. Nonetheless, Bigscience decided to train BLOOM with an additional layer normalization after the first embedding layer to avoid training instabilities. It is worth noting that the preliminary experiments were conducted in float16, while the final training was in bfloat16. Since then, float16 has been identified as the cause of many observed instabilities in training LLMs, and it is possible that bfloat16 alleviates the need for the embedding LayerNorm. BLOOM was trained on the ROOTS corpus, which includes 498 Hugging Face datasets that cover 46 languages and 3 programming languages. The training process includes data sourcing and processing stages. From the infrastructure standpoint, BLOOM was brought to life through the power of Megatron-DeepSpeed20, a cutting-edge framework for large-scale distributed training. This framework is a dynamic fusion of two parts: Megatron-LM21, which provides the Transformer implementation, tensor parallelism, and data loading primitives, and DeepSpeed22, which brings the ZeRO optimizer, model pipelining, and generally distributed training components to the table. The Megatron-DeepSpeed20 framework allows for efficient training with 3D parallelism, a combination of three complementary approaches to distributed deep learning. These approaches are: · Data parallelism (DP): This approach replicates the model multiple times and places each replica on a different device, where it is fed a slice of the data. The processing is done in parallel, and all model replicas are synchronized at the end of each training step. · Tensor parallelism (TP): This approach partitions individual layers of the model across multiple devices. This way, instead of having the whole activation or gradient tensor reside on a single GPU, shards of this tensor are placed on separate GPUs, making it possible to perform horizontal parallelism or intra-layer model parallelism. · Pipeline parallelism (PP): This approach splits up the model’s layers across multiple GPUs so that only a fraction of the layers of the model are placed on each GPU. This technique is sometimes called vertical parallelism. Finally, the Zero Redundancy Optimizer (ZeRO) allows different processes to only hold a fraction of the data (parameters, gradients, and optimizer states) required for a training step. The team used ZeRO stage 1, meaning that only the optimizer states are shared in this manner. With the combination of these four components, BLOOM was able to scale to hundreds of GPUs with extremely high GPU utilization. The team was thrilled to achieve 156 TFLOPs in their fastest configuration with A100 GPUs, hitting their objective with flying colors! A Special LLMBLOOM is a very special model in the LLM space. It shows that LLMs are not an exclusive domain of large AI labs, and when a large community of AI researchers comes together, magical things can happen! TheSequence is a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. We keep you up-to-date with the main AI news, trends, and technology developments. This post is only for paying subscribers of TheSequence Edge. You can give it as a gift. |
Key phrases
Older messages
💡TOMORROW: Chip Huyen & Kevin Stumpf on Making the Jump to Real-Time ML
Wednesday, February 22, 2023
Real-time ML is increasingly being adopted to power new applications across use cases in multiple industries. But for most companies, moving to real-time ML is a huge undertaking. It requires a shift
Who Has The Vision?
Sunday, February 12, 2023
On Sunday, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space
Edge 267: A Summary of our Machine Learning Interpretability Series
Tuesday, February 7, 2023
11 issues that cover the fundamental topics in machine learning interpretability.
The ChatGPT Challengers
Sunday, February 5, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
💡Share Your Thoughts on Applied ML for a $25 Amazon Gift Card*
Friday, February 3, 2023
As a member of the ML community, we'd love for you to participate in our industry survey—it'll only take 10 minutes, and the first 150 respondents will receive a $25 Amazon gift card! Your
You Might Also Like
Upgrade Your Git Game, Visual Studio Getting 'Command Palette,' Python/Java in VS Code, .NET 9 Preview, More
Thursday, April 25, 2024
Home | News | How To | Webcasts | Whitepapers | Advertise .NET Insight April 25, 2024 THIS ISSUE SPONSORED BY: ■ dtSearch® - INSTANTLY SEARCH TERABYTES Upgrade Your Git Game in Visual Studio 2022
🔒 The Vault Newsletter: April issue 🔑
Thursday, April 25, 2024
Get the latest business security news, updates, and advice from 1Password. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Top Tech 🏆 Lenovo ThinkPad X1 Carbon Gen 12 Laptop Review — Testing an AI Voice Recorder
Thursday, April 25, 2024
Also: The Roborock S8 MaxV Ultra Vacuum is Excellent, and More! How-To Geek Logo April 25, 2024 Take a look at our latest reviews, featuring fun tech like the Lenovo ThinkPad X1 Carbon laptop,
⚙️ r1
Thursday, April 25, 2024
Plus: UK investigating OpenAI
Charted | Economic Growth Forecasts for G7 and BRICS Countries in 2024 📊
Thursday, April 25, 2024
The IMF has released its economic growth forecasts for 2024. How do the G7 and BRICS countries compare in expected real GDP growth? View Online | Subscribe Presented by: Access European benchmarks with
Build5Nines Newsletter - April 25, 2024
Thursday, April 25, 2024
View this email in your browser Build5Nines Build5Nines Newsletter Thank you for subscribing! I look forward to sharing with you the latest cloud news, technical help, and other thoughts around DevOps
Discover the World's Easiest Parallel File System
Thursday, April 25, 2024
Join us in exploring the future of data management with Bjorn Kolbeck, a Google engineer turned CEO and Co-founder of Quobyte, the creators of the world's easiest parallel file system. ͏ ͏ ͏ ͏ ͏ ͏
Issue 314 - New Model 3 Performance is here
Thursday, April 25, 2024
View this email in your browser If you are just now finding out about Tesletter, you can subscribe here! If you already know Tesletter and want to support us, check out our Patreon page Issue 314 - New
Programmer Weekly - Issue 202
Thursday, April 25, 2024
View this email in your browser Programmer Weekly Welcome to issue 202 of Programmer Weekly. Let's get straight to the links this week. Quote of the Week "Computer science inverts the normal.
Python Weekly - Issue 647
Thursday, April 25, 2024
View this email in your browser Python Weekly Welcome to issue 647 of Python Weekly. Let's get straight to the links this week. From Our Sponsor Get Your Weekly Dose of Programming A weekly