The Sequence Chat: Vipul Ved Prakash, CEO, Together on Decentralized, Open Source Foundation Models
Was this email forwarded to you? Sign up here The Sequence Chat: Vipul Ved Prakash, CEO, Together on Decentralized, Open Source Foundation ModelsTogether has been behind some of the most interesting releases in open source foundation models.👤 Quick bio
My background is in large-scale distributed systems and information retrieval, and most of my professional career has involved solving problems of text understanding. I created an open source anti-spam filter called Vipul's Razor and founded a company based on it (Cloudmark) in the early 2000s. We used locality sensitive hashing and probabilistic classification, and got fantastic scale and results. This resulted in a long-lasting fascination with learning from unstructured data. I later founded a company (Topsy) that built a social media search and analytics system where we used machine learning and graph methods for ranking, deduplication and sentiment analysis. Topsy was acquired by Apple, and I directed various efforts there, including Spotlight search, federated learning systems that employed differential privacy, as well as Siri’s open-domain Q&A ability. 🛠 ML Work
I see foundation models as the terminal point of the first generation of human-computer interaction where we had to laboriously and precisely instruct computers to perform a task. Foundation models open up the possibility where we can simply describe our task and shift the burden of devising a solution to computers. In this framing, foundation models represent a very broad form of human-computer interface, perhaps occupying a position similar to compilers or microprocessors. A tremendous amount of economic and societal value of computing has come from open systems like the Internet, open programming languages and commodity microprocessors, so it seems important to us that there should be a strong open-source foundation model ecosystem.
It was challenging in several ways — it feels light years ago as OpenChatKit was create pre-LLaMA and Alpaca. Back then, it was quite unclear what makes a great chat model. We were lucky to have bet on instruction data as the key ingredient. We also made a quite explicit decision not to use OpenAI data, in order to have something clean from the copyright side. This constrained us a bit, as many chat models now use distilled data. Instead we used a community process, along with LAION, we successfully created a dataset of 40M "weak" instructions from various sources. This dataset was later augmented with the data provided by the users of OpenChatKit through a feedback app and is available for use as OIG. There is also the moderation model, even today, to our best knowledge, OpenChatKit is one of the few (if not the only) Chat model that recommends a layer of moderation through a specifically designed moderation model. Building such a model from scratch was a lot of work, but it is worthwhile as LLMs can get unintentionally offensive.
For RedPajama we closely followed the recipe as outlined in the LLaMA paper. We took the 7 different slices of data: Common Crawl, C4, Github, Books, ArXiv, Wikipedia and StackExchange, and carefully recreated the filtering process. This involved using the CCNet pipeline and several quality filters including a linear classifier that selects for Wikipedia-like pages. We tuned the hyperparameters to roughly get the same number of tokens from each slice as described in the LLaMA paper. To us, an “open model” implies not just open weights and a permissive license, but also open data and open data recipe. This allows the community to inspect the data, improve it, or filter, and preprocess it differently to create a model that better fits a downstream application. We think open data, and data creation recipes are critical for monotonic progress in open source models.
Distributed training is a key focus of Together's research work focused on reducing costs of training and inference. GPT-JT was trained using a pre-cursor to our CocktailSGD training method, which reduces the network requirements for training by 117x. CocktailSGD, as the name suggests, uses a combination of methods using quantization, asynchrony, local training and topK compression to be able to fine-tune large models over 1Gbps links. This allows us to use servers distributed across data centers, and connected over the open internet. It also allows for best possible utilization of GPUs within a data center. CocktailSGD paper has been accepted in ICML, so detailed exploration of this this will be published soon! We are quite optimistic that this set of techniques can be expanded and generalized to training large neural network based architectures.
I am excited about space state models, that are sub-quadratic, and support much larger contexts. There is research around applying imitation learning to solve for hallucinations, and work around data mixtures, like DoReMi, which will have a large impact. I think research on the data side is going to be the cornerstone of progress for the next few years. 💥 Miscellaneous – a set of rapid-fire questions
Open-source models are transparent. We know the data that composed these models, and have better ability to reason about their behavior. This will be increasingly important as foundation models are used by regulated industries and in mission-critical applications. Open-source models are privacy friendly. You can deploy them in the infrastructure under your control and use them with sensitive data. The single-tenant SaaS model of closed-source foundation models is problematic in this way. You have to place a lot of trust in a company, esp if you are going to use closed models with sensitive customer data. Open-source models can be customized. You can fine-tune them or pre-train them from the final (and in some cases intermediate) checkpoints on large amounts of data. We see 10-12 points of accuracy improvements on fine-tuned models. Open-source models give you control. They won't disappear or change in unexpected ways, like a closed source model might behind an API. Again, as these models mature, and become critical to applications, developers will want more control over the weights.
Centralized training on HPC style clusters is likely the fastest way of building models today. While there's a lot of scope for optimization in the centralized setting, we generally have good software and knowledge here, and for companies building large foundation models, it often makes sense to follow the best practices and go with well understood infrastructure. Decentralized training wins significantly on cost. It's easier to get slightly lower-end hardware in multiple locations, and you can achieve lower upfront costs and elasticity. Centralization will likely run into scale-out limits, so the largest models are likely going to be done in decentralized settings in the future. Architecturally, the progress feels similar to how things played out in the database world, where we have big monolithic databases, and then distributed and fault-tolerant options like DHTs appeared. There's a place for both, but I believe the techniques created in decentralized training will increasingly percolate to centralized settings.
I expect amazing progress in open source foundation models in 2023. We'll surpass GPT-3.5 quality in the open this year, and it will be a fairly big moment for open source. I also believe we'll continue to see optimization work that reduce the costs of working with AI, given the more low-powered resource landscape, and we'll see new architectures beyond transformers. I also expect SOTA open models in code, music, biology, and other niche areas. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 301: Retrieval-Augmented Language Models Methods
Tuesday, June 20, 2023
The ideas for decoupling model knowledge from language generation.
The Sequence Pulse: Inside Merlin, the Platform Powering Machine Learning at Shopify
Tuesday, June 20, 2023
The eCommerce giant published some details about the platform powering its ML workflows
Edge 300: Meet Falcon LLM: The Most Powerful Open Source LLM Released to Date
Tuesday, June 20, 2023
The model quickly top the Open LLM Leaderboard that ranks the performance of open source LLMs.
📝 Guest Post: Democratizing Vector Databases: Empowering Access & Equality*
Tuesday, June 20, 2023
In this guest post, Yujian Tang, Developer Advocate at Zilliz uncovers the true meaning behind democratizing a vector database and its profound implications to promote accessibility, equality, and
Yann LeCun's Vision Starts Materializing
Tuesday, June 20, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your