| Last year, with the paper "Textbooks Are All You Need," Microsoft introduced the smaller language model (SLM) Phi and broke existing scaling laws, demonstrating that high-quality data alone can be enough to build a model that can compete with much larger models. In less than a year, at Build 2024, they introduced additional Phi-3 models, including Phi-3-small, Phi-3-medium, and Phi-3-vision. What a crazy speed! We invited Sébastien Bubeck and Ronen Eldan to discuss the intuition behind their approach, surprising effect of a diverse dataset on a model, the challenges of using synthetic data, the future of SLMs, and more. | Hi Sebastien and Ronen, great to have you for this interview. "Small language model" (SLM) is a new term but already actively adopted. What was the intuition behind the approach you took in "Textbooks Are All You Need"? Can you walk us through your thinking process during the research phase? | Sébastien Bubeck: On the heels of the ‘“Sparks of AGI” paper we decided that to “understand” what’s happening in LLMs we had to try to build our own version. Of course, at the time we had no experience whatsoever in training those large transformers, and we hadn’t much data to use to get started, and as we had just learned in Sparks it was probably going to be difficult to evaluate whatever LLM we train (the zoo of academic benchmarks looked daunting at the time...). So what did we do? Well, we simply decide to narrow the scope as much as possible: we picked coding as a target, because there was an existing large dataset (The Stack), a simple and reasonable evaluation metric (HumanEval by OpenAI), and it was already shown that reasonably small networks with ~1B parameters could do a reasonable job at this. So at that point we had a clear goal: get a HumanEval score as high as possible with a SLM and just the couple dozen GPUs we had access to. The latter constraint was key too, it meant we had to restrict our data in some ways. Then naturally came the idea of filtering The Stack to keep only “educational content” (as defined by GPT-4 which was doing the filtering!) as well as writing “synthetic Textbooks” to further diversify the data we expose our model to. This whole project took merely a month, and it was incredibly exhilarating because every week we would get +10% on HumanEval! After a month we reached 50% and decided to call it a victory at that point 😊. Then the next question became: can this approach be used beyond a narrow domain such as coding? That’s when we embarked on common sense reasoning with phi-1.5, and then simply general cognitive ability with phi-2 and eventually phi-3! | The innovative approach with Phi was to use high-quality, curated datasets (TinyStories, CodeTextbook) instead of massive web data. What were the challenges and advantages of this approach compared to traditional large-scale data collection used for LLMs? Were there any unexpected findings in terms of model performance or behavior? | Ronen Eldan: The main challenge of curating a set from “thin air,” rather than using an existing source, is how to make the content diverse. What I mean by this is: suppose you want to create a dataset that teaches the model commonsense facts. You can’t just tell the model “give me a list of all commonsense facts there are” – this is not going to work, just as you can’t reasonably expect a human to provide such a list. The language model will simply give you a repetitive list containing many instances of the most commonly thought-of commonsense facts, but it won’t be able to find the more obscure ones. So it's not clear at all how to “span” all facts, and in general, how to get your dataset to span human knowledge. When we successfully created a diverse dataset, we were very surprised by its effect on the model – at a small scale (meaning that the model is efficient in terms of speed and cost), we got much more capable models than we had expected. | | The use of LLMs to generate synthetic training data for SLMs raises questions about potential biases and safety concerns inherited from the LLM. What measures should be taken to ensure the quality and safety of synthetic datasets? | Ronen Eldan: First of all, in terms of safety, the fact that every entry of the dataset is produced by a model that has already been aligned for safety is a huge benefit. In our experience, the "organic" datasets (such as datasets coming from web pages) are much worse (empirically) in terms of safety. That being said, it’s always important to test your model for safety once it’s been trained – you should never trust what the model will do before you have actually inspected it. In terms of bias, this is indeed a huge challenge. Here, too, I speculate that synthetic datasets will be better than most other datasets, and there are some benchmarks to suggest that. But I think the community is still a bit behind in terms of developing more reliable benchmarks for testing bias, simply because this is a very challenging thing to do (just as testing bias in humans is challenging). There’s quite a lot of work on that, and I think we’re advancing slowly but surely towards safer and less biased models. | Phi-3 models were developed in accordance with the Microsoft Responsible AI Standard, a company-wide set of requirements based on six principles: accountability, transparency, fairness, reliability and safety, privacy and security, and inclusiveness. Phi-3 models underwent rigorous safety measurement and evaluation, red-teaming, sensitive use review, and adherence to security guidance to ensure that these models are responsibly developed, tested, and deployed in alignment with Microsoft’s standards and best practices. | Our approach to safety training and evaluations are detailed in our technical paper, and we outline recommended uses and limitations in the model cards. See the model card collection. | Recently, at Microsoft Build, your team introduced a whole new family of Phi-3 models. What has been achieved since June 2023, in less than a year (what a crazy speed!)? | Ronen Eldan: Indeed, we’ve surprised ourselves in terms of how much progress we made within a year: In April 2024, we debuted the Phi-3 family of models to the world with the release of Phi-3-mini in Azure. And at Build 2024 we introduced more Phi-3 models to our Azure AI model catalog including the availability of Phi-3-small, Phi-3-medium and Phi-3-vision, a multimodal model that brings together language and vision capabilities. But there’s definitely still a gap between SLMs and LLMs. | We keep pushing the boundaries by coming up with more techniques to both generate synthetic data, collect high quality data from external sources and push on optimizing other aspects such as model architecture. For example, one small idea in terms of how you filter web data might make a big difference in model performance. I think we’re still far from understanding the full potential of SLMs. | How do you envision SLMs like Phi-3 being integrated into everyday devices (smartphones, cameras, sensors etc)? What new possibilities and potential use cases do you see emerging from this integration, and are there any upcoming features or improvements you're particularly excited about? Or more broadly, what is next for the Phi family? | Sébastien Bubeck: I personally cannot wait for SLMs like Phi-3 to be embedded everywhere. We’re starting to see this already with Phi Silica, a derivative of Phi-3-mini, designed specifically to run on the Copilot+ PCs we announced on May 20, right before Build 2024. Windows is the first platform to have a state-of-the-art SLM custom built for the NPU and shipping inbox later this year. | Eventually, I would love to be able to talk to my watch when I’m running and have it do a few actions on my behalf (Phi-3 can easily do that). Or have a SLM on my mobile phone when I go hike and I want to ask questions about various things that I’m seeing. The applications are endless here. | Microsoft banks on both large and small language models, both following and reshaping scaling laws. What is the reasoning behind this strategy? How does the company see these models coexisting? Do you see the industry shifting towards smaller, more efficient models? | Sébastien Bubeck: We think there is room for both small and large language models. When we care about high-stakes scenarios, be it in healthcare, or perhaps even simply in your copilot trying to understand your GitHub repo, we are willing to spend more energy and time to get the best possible response. That’s where you want to use frontier models such as GPT-4. But on the other hand, there are cases where you might be making millions of calls to the model, and what matters is the latency and the cost because you might be willing to get a few errors here and there at scale. That’s where SLMs will shine. Or it might be that for privacy and security reasons you really need all your computation to be done on-device, and again SLMs are perfect for this. So in the future, I see both directions (SLMs and LLMs) as incredibly important. It is about the Pareto frontier of cost versus quality, and any particular application will fall at a different place on this frontier! | Federated learning techniques like FLUTE show promise in collaboratively training models across decentralized devices while preserving data privacy. Is there any joint research adapting FLUTE or similar approaches for training and continually improving SLMs in federated settings? | Ronen Eldan: We’re currently not looking into that – since we’ve simply been moving too fast. But it’s on our list... | | Ronen, do you believe the way to AGI is to mimic the way children learn? | Ronen Eldan: Perhaps the most substantial difference between how models are trained and how children learn is the fact that children actually get continual feedback from the world. While a language model only gets one long chunk of text, each example being just one instance of some plausible piece of text, a child, on the other hand, in addition to “passive” input from the world, also has the ability to “try” a certain behavior and get feedback in terms of how the external world reacts to it (in this sense, it’s a much more interactive sort of learning). A language model cannot test “what if” questions (it cannot try out a new approach for writing text and get feedback on whether or not it’s good). However, the above is only valid for pretraining, and we already know that models can get better when you leverage feedback given by humans on model outputs. So in that sense, we already have models out in the world that are trying out different behaviors and evolving according to feedback. | What other research areas, apart from your everyday job, do you follow closely, and think are essential for moving AI industry forward? | Ronen Eldan: Any research that gives insights on the ability of language models to self-improve, and what methodology can be used to achieve that. | Thank you for reading! if you find it interesting, please do share or upgrade to Premium to support our effort 🤍 | |
|