The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMs
Was this email forwarded to you? Sign up here The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMsStarCoder is one of the most ambitious code generation foundation models released in recent times.👤 Quick bio
I originally did a master's degree in physics focusing on astrophysics, but around that time, I noticed the breakthroughs happening in ML so I decided to switch the focus of my studies towards ML. After finishing my master thesis in ML for precision medicine, I joined a start-up as a data scientist where I worked on a wide range of industry projects. This is also where I met Lewis Tunstall and as language models with BERT and GPT-2 started taking off we decided to start working on a textbook about transformer models and the Hugging Face ecosystem. When reaching out to Hugging Face we met Thomas Wolf, the Chief Science Officer and co-founder at Hugging Face, who joined the project as a co-author. As the book came to an end Lewis and I joined Hugging Face. Since then, I worked as a Machine Learning Engineer in both the open-source and research team working on projects such as Evaluate, TRL, CodeParrot and recently co-leading BigCode research collaboration. 🛠 ML Work
Codex and CoPilot had a big impact on the community and developers and can lead to a significant productivity boost in the professional workflow. However, those models are closed source, so you cannot freely adapt them to your use-case or experiment with fine-tuning and you need to send your data to an external API. In addition, there are big open questions around data governance: what data was trained on, what licenses were included, how to attribute sources, and what if you want to be excluded? There are several open models, but they lack CoPilot’s performance and also don’t fully disclaim how the dataset was created and filtered. The goal of BigCode and subsequently StarCoder was to address these issues and produce a high-performance code model with clear data governance structures. The project is a spiritual successor of BigScience and is run as an open research collaboration where every research or industry expert can join.
StarCoderBase is trained on 80+ programming languages for 1T tokens. Since a lot of developers are working on Python we continued to trainStarCoder for about 35B tokens (~3% of full training) on the Python subset which lead to a significant performance boost. Surprisingly, it also lead to a performance increase in some other languages such as R or Swift. On the other hand, we found that StarCoderBase can be better prompted to act as a tech assistant: by simply adding a few example conversations to the context (see the TA prompt) you can ask StarCoderBase to help you solve programming related questions. StarChat (alpha) is even better at that since it was specifically fine-tuned on conversations and instructions.
The data curation probably made up 60-80% of the whole project. There were two main ingredients to create a good pretraining dataset. First, we applied strong near-deduplication where similar files are removed. It might sound counterintuitive, but first strongly near-deduplicating the dataset allows you to safely train for a few epochs without performance degradation. Second, for each file extension we examined at least 100 samples and derived heuristics to exclude low quality files (e.g. data or auto-generated files). In addition we labelled a PII dataset for code to train a PII detector. At that scale even applying that PII model to the whole dataset required several hundred GPU hours. Also, we excluded code files from users that had opted out of the dataset. Finally, for the training we used 512 A100 GPUs for 24 days to train the model. The training was extremely smooth. We had some restarts due to hardware failures but those mostly happened automatically. Training at that scale with modern tools such as Megatron and using BF16 is very smooth.
Indeed, we also found that Jupyter notebooks are a treasure trove of interesting data with lots of tutorials and examples. We parsed the notebooks in two ways: - we converted the notebooks to source code where the markdown cells become code comments. - we parsed the notebooks into a structured format where the cell become text-code-output-text chains separated by special tokens. This also allows us to easily provide the whole notebook as context (incl. cell outputs) for code completion in Jupyter notebooks (see this Jupyter plugin).
Indeed, in a sense StarCoder is the combination of the best available techniques and most of the performance can probably be attributed to careful work on the dataset. The architecture goal was to make the model easy to use and deploy for users and fulfill their needs: fast inference, cheap generation, long contexts, and infilling using context from both sides. To achieve these goals, we trained a moderately sized but fast model (“just” 15B) with MQA to scale generation, implemented Flash Attention to train with context windows of 8192 tokens and used the Fill-in-the-Middle objective in addition to the normal autoregressive language modeling objective. 💥 Miscellaneous – a set of rapid-fire questions
I am really excited about the application of ML to Science, such as health, chemistry, math or physics. One application that excites me most is AlphaFold, that helps scientists speed up the protein development process to an impressive scale. Technologies like this that support scientists will help science progress even faster.
The most popular one is HumanEval which tests LLMs for code on a variety of coding challenges in Python. We also used MultiPL-E which extends HumanEval to over a dozen other languages. However, HumanEval only consists of coding interview style challenges and as such does not capture the full range of programming tasks.
One thing we learned from releases such as Stable Diffusion or Llama is the creativity and capability of the open-source community. Within weeks of the release the community built dozens of variants of the model as well as custom applications – more than any company or institution could come up with. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications. While it is easier to keep control over closed-source API models it makes it harder to build trust around such systems if there is no transparency as well as giving researchers the possibility to make them safer.
There are lots of interesting avenues for future code LLMs! Evaluation is definitely in its infancy compared to natural language and will need to improve to better capture the user experience. In terms of generation capacity, the models are getting very good at function level completions but struggle with building longer, more complex structures as well as editing the whole codebase to implement a new feature for example. Additionally, they are not able to interactively debug code where they execute a piece of code and based on the error or behavior improve the solution. Solving these challenges opens a lot of very exciting directions! You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 293: Instruction Following Language Models
Tuesday, May 23, 2023
Instruction following LLs, OpenAI's InstructGPT and the Dust LLM framework.
📝 Guest Post: How to Customize Auto-GPT for Your Unique Use Case: A Comprehensive Guide*
Monday, May 22, 2023
An Introduction to Auto-GPT
The Undisputed Champion of Open Source Generative AI
Sunday, May 21, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
📍 Join 1000s of data scientists from around the globe at The Future of Data-Centric AI on June 7-8
Friday, May 19, 2023
Don't miss Matei Zaharia, DJ Patil, Yoav Shoham, and more on June 7!
Meet OpenChatKit: The Open Source Alternative to ChatGPT
Thursday, May 18, 2023
The framework was created by the collaboration of Togeter, LAION, and Ontocord.
You Might Also Like
😼 The hottest new AI engineer
Sunday, November 24, 2024
Plus, an uncheatable tech screen app Product Hunt Sunday, Nov 24 The Roundup This newsletter was brought to you by Countly Happy Sunday! Welcome back to another edition of The Roundup, folks. We've
Transformers are Eating Quantum
Sunday, November 24, 2024
DeepMind's AlphaQubit addresses one of the main challenges in quantum computing. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Retro Recomendo: Gift Ideas
Sunday, November 24, 2024
Recomendo - issue #438 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Kotlin Weekly #434
Sunday, November 24, 2024
ISSUE #434 24th of November 2024 Hi Kotliners! Next week is the last one to send a paper proposal for the KotlinConf. We hope to see you there next year. Announcements State of Kotlin Scripting 2024
Weekend Reading — More time to write
Sunday, November 24, 2024
More Time to Write A fully functional clock that ticks backwards, giving you more time to write. Tech Stuff Martijn Faassen (FWIW I don't know how to use any debugger other than console.log) People
🕹️ Retro Consoles Worth Collecting While You Still Can — Is Last Year's Flagship Phone Worth Your Money?
Saturday, November 23, 2024
Also: Best Outdoor Smart Plugs, and More! How-To Geek Logo November 23, 2024 Did You Know After the "flair" that servers wore—buttons and other adornments—was made the butt of a joke in the
JSK Daily for Nov 23, 2024
Saturday, November 23, 2024
JSK Daily for Nov 23, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component
Not Ready For The Camera 📸
Saturday, November 23, 2024
What (and who) video-based social media leaves out. Here's a version for your browser. Hunting for the end of the long tail • November 23, 2024 Not Ready For The Camera Why hasn't video
Daily Coding Problem: Problem #1617 [Easy]
Saturday, November 23, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Microsoft. You are given an string representing the initial conditions of some dominoes.
Ranked | The Tallest and Shortest Countries, by Average Height 📏
Saturday, November 23, 2024
These two maps compare the world's tallest countries, and the world's shortest countries, by average height. View Online | Subscribe | Download Our App TIME IS RUNNING OUT There's just 3