The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMs
Was this email forwarded to you? Sign up here The Sequence Chat: Hugging Face's Leandro von Werra on StarCoder and Code Generating LLMsStarCoder is one of the most ambitious code generation foundation models released in recent times.👤 Quick bio
I originally did a master's degree in physics focusing on astrophysics, but around that time, I noticed the breakthroughs happening in ML so I decided to switch the focus of my studies towards ML. After finishing my master thesis in ML for precision medicine, I joined a start-up as a data scientist where I worked on a wide range of industry projects. This is also where I met Lewis Tunstall and as language models with BERT and GPT-2 started taking off we decided to start working on a textbook about transformer models and the Hugging Face ecosystem. When reaching out to Hugging Face we met Thomas Wolf, the Chief Science Officer and co-founder at Hugging Face, who joined the project as a co-author. As the book came to an end Lewis and I joined Hugging Face. Since then, I worked as a Machine Learning Engineer in both the open-source and research team working on projects such as Evaluate, TRL, CodeParrot and recently co-leading BigCode research collaboration. 🛠 ML Work
Codex and CoPilot had a big impact on the community and developers and can lead to a significant productivity boost in the professional workflow. However, those models are closed source, so you cannot freely adapt them to your use-case or experiment with fine-tuning and you need to send your data to an external API. In addition, there are big open questions around data governance: what data was trained on, what licenses were included, how to attribute sources, and what if you want to be excluded? There are several open models, but they lack CoPilot’s performance and also don’t fully disclaim how the dataset was created and filtered. The goal of BigCode and subsequently StarCoder was to address these issues and produce a high-performance code model with clear data governance structures. The project is a spiritual successor of BigScience and is run as an open research collaboration where every research or industry expert can join.
StarCoderBase is trained on 80+ programming languages for 1T tokens. Since a lot of developers are working on Python we continued to trainStarCoder for about 35B tokens (~3% of full training) on the Python subset which lead to a significant performance boost. Surprisingly, it also lead to a performance increase in some other languages such as R or Swift. On the other hand, we found that StarCoderBase can be better prompted to act as a tech assistant: by simply adding a few example conversations to the context (see the TA prompt) you can ask StarCoderBase to help you solve programming related questions. StarChat (alpha) is even better at that since it was specifically fine-tuned on conversations and instructions.
The data curation probably made up 60-80% of the whole project. There were two main ingredients to create a good pretraining dataset. First, we applied strong near-deduplication where similar files are removed. It might sound counterintuitive, but first strongly near-deduplicating the dataset allows you to safely train for a few epochs without performance degradation. Second, for each file extension we examined at least 100 samples and derived heuristics to exclude low quality files (e.g. data or auto-generated files). In addition we labelled a PII dataset for code to train a PII detector. At that scale even applying that PII model to the whole dataset required several hundred GPU hours. Also, we excluded code files from users that had opted out of the dataset. Finally, for the training we used 512 A100 GPUs for 24 days to train the model. The training was extremely smooth. We had some restarts due to hardware failures but those mostly happened automatically. Training at that scale with modern tools such as Megatron and using BF16 is very smooth.
Indeed, we also found that Jupyter notebooks are a treasure trove of interesting data with lots of tutorials and examples. We parsed the notebooks in two ways: - we converted the notebooks to source code where the markdown cells become code comments. - we parsed the notebooks into a structured format where the cell become text-code-output-text chains separated by special tokens. This also allows us to easily provide the whole notebook as context (incl. cell outputs) for code completion in Jupyter notebooks (see this Jupyter plugin).
Indeed, in a sense StarCoder is the combination of the best available techniques and most of the performance can probably be attributed to careful work on the dataset. The architecture goal was to make the model easy to use and deploy for users and fulfill their needs: fast inference, cheap generation, long contexts, and infilling using context from both sides. To achieve these goals, we trained a moderately sized but fast model (“just” 15B) with MQA to scale generation, implemented Flash Attention to train with context windows of 8192 tokens and used the Fill-in-the-Middle objective in addition to the normal autoregressive language modeling objective. 💥 Miscellaneous – a set of rapid-fire questions
I am really excited about the application of ML to Science, such as health, chemistry, math or physics. One application that excites me most is AlphaFold, that helps scientists speed up the protein development process to an impressive scale. Technologies like this that support scientists will help science progress even faster.
The most popular one is HumanEval which tests LLMs for code on a variety of coding challenges in Python. We also used MultiPL-E which extends HumanEval to over a dozen other languages. However, HumanEval only consists of coding interview style challenges and as such does not capture the full range of programming tasks.
One thing we learned from releases such as Stable Diffusion or Llama is the creativity and capability of the open-source community. Within weeks of the release the community built dozens of variants of the model as well as custom applications – more than any company or institution could come up with. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications. While it is easier to keep control over closed-source API models it makes it harder to build trust around such systems if there is no transparency as well as giving researchers the possibility to make them safer.
There are lots of interesting avenues for future code LLMs! Evaluation is definitely in its infancy compared to natural language and will need to improve to better capture the user experience. In terms of generation capacity, the models are getting very good at function level completions but struggle with building longer, more complex structures as well as editing the whole codebase to implement a new feature for example. Additionally, they are not able to interactively debug code where they execute a piece of code and based on the error or behavior improve the solution. Solving these challenges opens a lot of very exciting directions! You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Key phrases
Older messages
Edge 293: Instruction Following Language Models
Tuesday, May 23, 2023
Instruction following LLs, OpenAI's InstructGPT and the Dust LLM framework.
📝 Guest Post: How to Customize Auto-GPT for Your Unique Use Case: A Comprehensive Guide*
Monday, May 22, 2023
An Introduction to Auto-GPT
The Undisputed Champion of Open Source Generative AI
Sunday, May 21, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
📍 Join 1000s of data scientists from around the globe at The Future of Data-Centric AI on June 7-8
Friday, May 19, 2023
Don't miss Matei Zaharia, DJ Patil, Yoav Shoham, and more on June 7!
Meet OpenChatKit: The Open Source Alternative to ChatGPT
Thursday, May 18, 2023
The framework was created by the collaboration of Togeter, LAION, and Ontocord.
You Might Also Like
Daily Coding Problem: Problem #1395 [Hard]
Thursday, March 28, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Implement an LRU (Least Recently Used) cache. It should be able to be
72 x $99 tickets left for virtual product conference (May 2)
Thursday, March 28, 2024
ACT FAST! ONLY 72 TICKETS AVAILABLE AT THE DISCOUNTED RATE OF $99! MAY 2, 2024 | ONLINE ACROSS THE WORLD Join product people from around the world on Thursday, May 2, for INDUSTRY, the #1 Virtual
⚙️ "I'm a GPT builder" 😎
Thursday, March 28, 2024
Plus: Elon's Grok will be available to all
🔒 The Vault Newsletter: March issue 🔑
Thursday, March 28, 2024
Get the latest business security news, updates, and advice from 1Password. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
📑 Discover The Power of AI With UPDF — 63% Off For a Limited Time
Thursday, March 28, 2024
Digitally Read/Sign/Edit/Summarize PDFs Seamlessly. Available Now at a Huge Discount! How-To Geek Logo March 28, 2024 Tired of Dealing With PDFs? Try AI-Powered UPDF With the Biggest Discount of the
Issue 310 - New Autopark looks awesome!
Thursday, March 28, 2024
View this email in your browser If you are just now finding out about Tesletter, you can subscribe here! If you already know Tesletter and want to support us, check out our Patreon page Issue 310 - New
Programmer Weekly - Issue 199
Thursday, March 28, 2024
View this email in your browser Programmer Weekly Welcome to issue 199 of Programmer Weekly. Let's get straight to the links this week. Quote of the Week "Optimization hinders evolution.
wpmail.me issue#660
Thursday, March 28, 2024
wpMail.me wpmail.me issue#660 - The weekly WordPress newsletter. No spam, no nonsense. - March 27, 2024 Is this email not displaying correctly? View it in your browser. News & Articles What's
New attack targets Apple devices
Thursday, March 28, 2024
Eufy's new Mach S1 Pro; Using VR in a car; April solar eclipse FAQ -- ZDNET ZDNET Tech Today - US March 28, 2024 placeholder New password reset attack targets Apple device users - what to do if it
Web Tools #558 - ImageKit Review, JS Libraries, Git/CLI Tools, Jamstack
Thursday, March 28, 2024
WEB VERSION Issue #558 • March 28, 2024 The following is a paid product review for ImageKit's Video API, a developer-friendly toolkit for real-time video optimizations and transformations, to help