The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMs
Was this email forwarded to you? Sign up here The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMsStarCoder is one of the most ambitious code generation foundation models released in recent times.👤 Quick bio
I originally did a master's degree in physics focusing on astrophysics, but around that time, I noticed the breakthroughs happening in ML so I decided to switch the focus of my studies towards ML. After finishing my master thesis in ML for precision medicine, I joined a start-up as a data scientist where I worked on a wide range of industry projects. This is also where I met Lewis Tunstall and as language models with BERT and GPT-2 started taking off we decided to start working on a textbook about transformer models and the Hugging Face ecosystem. When reaching out to Hugging Face we met Thomas Wolf, the Chief Science Officer and co-founder at Hugging Face, who joined the project as a co-author. As the book came to an end Lewis and I joined Hugging Face. Since then, I worked as a Machine Learning Engineer in both the open-source and research team working on projects such as Evaluate, TRL, CodeParrot and recently co-leading BigCode research collaboration. 🛠 ML Work
Codex and CoPilot had a big impact on the community and developers and can lead to a significant productivity boost in the professional workflow. However, those models are closed source, so you cannot freely adapt them to your use-case or experiment with fine-tuning and you need to send your data to an external API. In addition, there are big open questions around data governance: what data was trained on, what licenses were included, how to attribute sources, and what if you want to be excluded? There are several open models, but they lack CoPilot’s performance and also don’t fully disclaim how the dataset was created and filtered. The goal of BigCode and subsequently StarCoder was to address these issues and produce a high-performance code model with clear data governance structures. The project is a spiritual successor of BigScience and is run as an open research collaboration where every research or industry expert can join.
StarCoderBase is trained on 80+ programming languages for 1T tokens. Since a lot of developers are working on Python we continued to trainStarCoder for about 35B tokens (~3% of full training) on the Python subset which lead to a significant performance boost. Surprisingly, it also lead to a performance increase in some other languages such as R or Swift. On the other hand, we found that StarCoderBase can be better prompted to act as a tech assistant: by simply adding a few example conversations to the context (see the TA prompt) you can ask StarCoderBase to help you solve programming related questions. StarChat (alpha) is even better at that since it was specifically fine-tuned on conversations and instructions.
The data curation probably made up 60-80% of the whole project. There were two main ingredients to create a good pretraining dataset. First, we applied strong near-deduplication where similar files are removed. It might sound counterintuitive, but first strongly near-deduplicating the dataset allows you to safely train for a few epochs without performance degradation. Second, for each file extension we examined at least 100 samples and derived heuristics to exclude low quality files (e.g. data or auto-generated files). In addition we labelled a PII dataset for code to train a PII detector. At that scale even applying that PII model to the whole dataset required several hundred GPU hours. Also, we excluded code files from users that had opted out of the dataset. Finally, for the training we used 512 A100 GPUs for 24 days to train the model. The training was extremely smooth. We had some restarts due to hardware failures but those mostly happened automatically. Training at that scale with modern tools such as Megatron and using BF16 is very smooth.
Indeed, we also found that Jupyter notebooks are a treasure trove of interesting data with lots of tutorials and examples. We parsed the notebooks in two ways: - we converted the notebooks to source code where the markdown cells become code comments. - we parsed the notebooks into a structured format where the cell become text-code-output-text chains separated by special tokens. This also allows us to easily provide the whole notebook as context (incl. cell outputs) for code completion in Jupyter notebooks (see this Jupyter plugin).
Indeed, in a sense StarCoder is the combination of the best available techniques and most of the performance can probably be attributed to careful work on the dataset. The architecture goal was to make the model easy to use and deploy for users and fulfill their needs: fast inference, cheap generation, long contexts, and infilling using context from both sides. To achieve these goals, we trained a moderately sized but fast model (“just” 15B) with MQA to scale generation, implemented Flash Attention to train with context windows of 8192 tokens and used the Fill-in-the-Middle objective in addition to the normal autoregressive language modeling objective. 💥 Miscellaneous – a set of rapid-fire questions
I am really excited about the application of ML to Science, such as health, chemistry, math or physics. One application that excites me most is AlphaFold, that helps scientists speed up the protein development process to an impressive scale. Technologies like this that support scientists will help science progress even faster.
The most popular one is HumanEval which tests LLMs for code on a variety of coding challenges in Python. We also used MultiPL-E which extends HumanEval to over a dozen other languages. However, HumanEval only consists of coding interview style challenges and as such does not capture the full range of programming tasks.
One thing we learned from releases such as Stable Diffusion or Llama is the creativity and capability of the open-source community. Within weeks of the release the community built dozens of variants of the model as well as custom applications – more than any company or institution could come up with. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications. While it is easier to keep control over closed-source API models it makes it harder to build trust around such systems if there is no transparency as well as giving researchers the possibility to make them safer.
There are lots of interesting avenues for future code LLMs! Evaluation is definitely in its infancy compared to natural language and will need to improve to better capture the user experience. In terms of generation capacity, the models are getting very good at function level completions but struggle with building longer, more complex structures as well as editing the whole codebase to implement a new feature for example. Additionally, they are not able to interactively debug code where they execute a piece of code and based on the error or behavior improve the solution. Solving these challenges opens a lot of very exciting directions! You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 293: Instruction Following Language Models
Tuesday, May 23, 2023
Instruction following LLs, OpenAI's InstructGPT and the Dust LLM framework.
📝 Guest Post: How to Customize Auto-GPT for Your Unique Use Case: A Comprehensive Guide*
Monday, May 22, 2023
An Introduction to Auto-GPT
The Undisputed Champion of Open Source Generative AI
Sunday, May 21, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
📍 Join 1000s of data scientists from around the globe at The Future of Data-Centric AI on June 7-8
Friday, May 19, 2023
Don't miss Matei Zaharia, DJ Patil, Yoav Shoham, and more on June 7!
Meet OpenChatKit: The Open Source Alternative to ChatGPT
Thursday, May 18, 2023
The framework was created by the collaboration of Togeter, LAION, and Ontocord.
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your