Inside BLOOM: How Thousands of AI Researchers Created an Open Source ChatGPT Alternative
Was this email forwarded to you? Sign up here. You can also give it as a gift. Inside BLOOM: How Thousands of AI Researchers Created an Open Source ChatGPT AlternativeAn open-source LLM shows that tech incumbents are not the only companies able to create massive models.When we think about large language models(LLMs) alternatives to ChatGPT, we tend to think about projects from large AI labs or ultra-well-financed startups. But what happens when a large number of AI researchers decide to collaborate to make LLMs available to mainstream researchers? The result is BLOOM, an open source 176 billion parameters LLMs that is able to master tasks in 46 languages and 13 programming languages. The development of BLOOM was coordinated by BigScience, a vibrant open research collaboration with a mission to publicly release an LLM. The project was brought to life after being awarded a computing grant by GENCI on its Jean Zay supercomputer at IDRIS/CNRS. The project was founded by Hugging Face and the French NLP community and soon attracted a diverse international collaboration with a goal to support linguistic, geographical, and scientific diversity. Over 1200 participants from 38 countries, including experts in machine learning, computer science, linguistics, statistics, socio-cultural anthropology, philosophy, law, and other fields, registered with BigScience and were given access to its communication channels. The organization of the BigScience effort, reflecting the related research questions tackled, was structured into 30 working groups, each comprising several participants with various levels of involvement and chairs tasked with self-organizing around specific aspects of the overall project. Participants were encouraged to join multiple working groups to share experiences and information, leading to a dynamic and collaborative environment. The majority of the working groups focused on tasks directly linked to the development of BLOOM. BLOOMThe BLOOM architecture is based on causal-decoder transformer models. This type of architecture is pretty standard for LLMs above 100B parameters as they have shown the best performance. Beyond the architecture of choice, BLOOM introduced a couple of key innovations to standard causal-decoder models. I. ALiBi Positional Embeddings: Instead of adding positional information to the embedding layer, ALiBi leverages a unique approach by directly attenuating the attention scores based on the distance between the keys and queries. The initial motivation for ALiBi was its ability to extrapolate to longer sequences, but the researchers were thrilled to discover that it also led to smoother training and improved downstream performance, even outperforming both learned and rotary embeddings. II. Embedding LayerNorm: In preliminary experiments on a massive 104B parameters model, the team experimented with an additional layer normalization immediately after the embedding layer. This significantly improved training stability. Nonetheless, Bigscience decided to train BLOOM with an additional layer normalization after the first embedding layer to avoid training instabilities. It is worth noting that the preliminary experiments were conducted in float16, while the final training was in bfloat16. Since then, float16 has been identified as the cause of many observed instabilities in training LLMs, and it is possible that bfloat16 alleviates the need for the embedding LayerNorm. BLOOM was trained on the ROOTS corpus, which includes 498 Hugging Face datasets that cover 46 languages and 3 programming languages. The training process includes data sourcing and processing stages. From the infrastructure standpoint, BLOOM was brought to life through the power of Megatron-DeepSpeed20, a cutting-edge framework for large-scale distributed training. This framework is a dynamic fusion of two parts: Megatron-LM21, which provides the Transformer implementation, tensor parallelism, and data loading primitives, and DeepSpeed22, which brings the ZeRO optimizer, model pipelining, and generally distributed training components to the table. The Megatron-DeepSpeed20 framework allows for efficient training with 3D parallelism, a combination of three complementary approaches to distributed deep learning. These approaches are: · Data parallelism (DP): This approach replicates the model multiple times and places each replica on a different device, where it is fed a slice of the data. The processing is done in parallel, and all model replicas are synchronized at the end of each training step. · Tensor parallelism (TP): This approach partitions individual layers of the model across multiple devices. This way, instead of having the whole activation or gradient tensor reside on a single GPU, shards of this tensor are placed on separate GPUs, making it possible to perform horizontal parallelism or intra-layer model parallelism. · Pipeline parallelism (PP): This approach splits up the model’s layers across multiple GPUs so that only a fraction of the layers of the model are placed on each GPU. This technique is sometimes called vertical parallelism. Finally, the Zero Redundancy Optimizer (ZeRO) allows different processes to only hold a fraction of the data (parameters, gradients, and optimizer states) required for a training step. The team used ZeRO stage 1, meaning that only the optimizer states are shared in this manner. With the combination of these four components, BLOOM was able to scale to hundreds of GPUs with extremely high GPU utilization. The team was thrilled to achieve 156 TFLOPs in their fastest configuration with A100 GPUs, hitting their objective with flying colors! A Special LLMBLOOM is a very special model in the LLM space. It shows that LLMs are not an exclusive domain of large AI labs, and when a large community of AI researchers comes together, magical things can happen! TheSequence is a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. We keep you up-to-date with the main AI news, trends, and technology developments. This post is only for paying subscribers of TheSequence Edge. You can give it as a gift. |
Older messages
💡TOMORROW: Chip Huyen & Kevin Stumpf on Making the Jump to Real-Time ML
Wednesday, February 22, 2023
Real-time ML is increasingly being adopted to power new applications across use cases in multiple industries. But for most companies, moving to real-time ML is a huge undertaking. It requires a shift
Who Has The Vision?
Sunday, February 12, 2023
On Sunday, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space
Edge 267: A Summary of our Machine Learning Interpretability Series
Tuesday, February 7, 2023
11 issues that cover the fundamental topics in machine learning interpretability.
The ChatGPT Challengers
Sunday, February 5, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
💡Share Your Thoughts on Applied ML for a $25 Amazon Gift Card*
Friday, February 3, 2023
As a member of the ML community, we'd love for you to participate in our industry survey—it'll only take 10 minutes, and the first 150 respondents will receive a $25 Amazon gift card! Your
You Might Also Like
Corporate Casserole 🥘
Monday, November 25, 2024
How marketing and lobbying inspired Thanksgiving traditions. Here's a version for your browser. Hunting for the end of the long tail • November 24, 2024 Hey all, Ernie here with a classic
WP Weekly 221 - Bluesky - WP Assets on CDN, Limit Font Subsets, ACF Pro Now
Monday, November 25, 2024
Read on Website WP Weekly 221 / Bluesky Have you joined Bluesky, like many other WordPress users, a new place for an online social presence? Also in this issue: CrawlWP, Asset Management Framework,
🤳🏻 We Need More High-End Small Phones — Linux Terminal Setup Tips
Sunday, November 24, 2024
Also: Why I Switched From Google Maps to Apple Maps, and More! How-To Geek Logo November 24, 2024 Did You Know Medieval moats didn't just protect castles from invaders approaching over land, but
JSK Daily for Nov 24, 2024
Sunday, November 24, 2024
JSK Daily for Nov 24, 2024 View this email in your browser A community curated daily e-mail of JavaScript news JavaScript Certification Black Friday Offer – Up to 54% Off! Certificates.dev, the trusted
OpenAI's turbulent early years - Sync #494
Sunday, November 24, 2024
Plus: Anthropic and xAI raise billions of dollars; can a fluffy robot replace a living pet; Chinese reasoning model DeepSeek R1; robot-dog runs full marathon; a $12000 surgery to change eye colour ͏ ͏
Daily Coding Problem: Problem #1618 [Easy]
Sunday, November 24, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Zillow. Let's define a "sevenish" number to be one which is either a power
PD#602 How Netflix Built Self-Healing System to Survive Concurrency Bug
Sunday, November 24, 2024
CPUs were dying, the bug was temporarily un-fixable, and they had no viable path forward
RD#602 What are React Portals?
Sunday, November 24, 2024
A powerful feature that allows rendering components outside their parent component's DOM hierarchy
C#533 What's new in C# 13
Sunday, November 24, 2024
Params collections support, a new Lock type and others
⚙️ Smaller but deeper: Writer’s secret weapon to better AI
Sunday, November 24, 2024
November 24, 2024 | Read Online Ian Krietzberg Good morning. I sat down recently with Waseem Alshikh, the co-founder and CTO of enterprise AI firm Writer. Writer recently made waves with the release of