The Sequence Chat: Vipul Ved Prakash, CEO, Together on Decentralized, Open Source Foundation Models
Was this email forwarded to you? Sign up here The Sequence Chat: Vipul Ved Prakash, CEO, Together on Decentralized, Open Source Foundation ModelsTogether has been behind some of the most interesting releases in open source foundation models.👤 Quick bio
My background is in large-scale distributed systems and information retrieval, and most of my professional career has involved solving problems of text understanding. I created an open source anti-spam filter called Vipul's Razor and founded a company based on it (Cloudmark) in the early 2000s. We used locality sensitive hashing and probabilistic classification, and got fantastic scale and results. This resulted in a long-lasting fascination with learning from unstructured data. I later founded a company (Topsy) that built a social media search and analytics system where we used machine learning and graph methods for ranking, deduplication and sentiment analysis. Topsy was acquired by Apple, and I directed various efforts there, including Spotlight search, federated learning systems that employed differential privacy, as well as Siri’s open-domain Q&A ability. 🛠 ML Work
I see foundation models as the terminal point of the first generation of human-computer interaction where we had to laboriously and precisely instruct computers to perform a task. Foundation models open up the possibility where we can simply describe our task and shift the burden of devising a solution to computers. In this framing, foundation models represent a very broad form of human-computer interface, perhaps occupying a position similar to compilers or microprocessors. A tremendous amount of economic and societal value of computing has come from open systems like the Internet, open programming languages and commodity microprocessors, so it seems important to us that there should be a strong open-source foundation model ecosystem.
It was challenging in several ways — it feels light years ago as OpenChatKit was create pre-LLaMA and Alpaca. Back then, it was quite unclear what makes a great chat model. We were lucky to have bet on instruction data as the key ingredient. We also made a quite explicit decision not to use OpenAI data, in order to have something clean from the copyright side. This constrained us a bit, as many chat models now use distilled data. Instead we used a community process, along with LAION, we successfully created a dataset of 40M "weak" instructions from various sources. This dataset was later augmented with the data provided by the users of OpenChatKit through a feedback app and is available for use as OIG. There is also the moderation model, even today, to our best knowledge, OpenChatKit is one of the few (if not the only) Chat model that recommends a layer of moderation through a specifically designed moderation model. Building such a model from scratch was a lot of work, but it is worthwhile as LLMs can get unintentionally offensive.
For RedPajama we closely followed the recipe as outlined in the LLaMA paper. We took the 7 different slices of data: Common Crawl, C4, Github, Books, ArXiv, Wikipedia and StackExchange, and carefully recreated the filtering process. This involved using the CCNet pipeline and several quality filters including a linear classifier that selects for Wikipedia-like pages. We tuned the hyperparameters to roughly get the same number of tokens from each slice as described in the LLaMA paper. To us, an “open model” implies not just open weights and a permissive license, but also open data and open data recipe. This allows the community to inspect the data, improve it, or filter, and preprocess it differently to create a model that better fits a downstream application. We think open data, and data creation recipes are critical for monotonic progress in open source models.
Distributed training is a key focus of Together's research work focused on reducing costs of training and inference. GPT-JT was trained using a pre-cursor to our CocktailSGD training method, which reduces the network requirements for training by 117x. CocktailSGD, as the name suggests, uses a combination of methods using quantization, asynchrony, local training and topK compression to be able to fine-tune large models over 1Gbps links. This allows us to use servers distributed across data centers, and connected over the open internet. It also allows for best possible utilization of GPUs within a data center. CocktailSGD paper has been accepted in ICML, so detailed exploration of this this will be published soon! We are quite optimistic that this set of techniques can be expanded and generalized to training large neural network based architectures.
I am excited about space state models, that are sub-quadratic, and support much larger contexts. There is research around applying imitation learning to solve for hallucinations, and work around data mixtures, like DoReMi, which will have a large impact. I think research on the data side is going to be the cornerstone of progress for the next few years. 💥 Miscellaneous – a set of rapid-fire questions
Open-source models are transparent. We know the data that composed these models, and have better ability to reason about their behavior. This will be increasingly important as foundation models are used by regulated industries and in mission-critical applications. Open-source models are privacy friendly. You can deploy them in the infrastructure under your control and use them with sensitive data. The single-tenant SaaS model of closed-source foundation models is problematic in this way. You have to place a lot of trust in a company, esp if you are going to use closed models with sensitive customer data. Open-source models can be customized. You can fine-tune them or pre-train them from the final (and in some cases intermediate) checkpoints on large amounts of data. We see 10-12 points of accuracy improvements on fine-tuned models. Open-source models give you control. They won't disappear or change in unexpected ways, like a closed source model might behind an API. Again, as these models mature, and become critical to applications, developers will want more control over the weights.
Centralized training on HPC style clusters is likely the fastest way of building models today. While there's a lot of scope for optimization in the centralized setting, we generally have good software and knowledge here, and for companies building large foundation models, it often makes sense to follow the best practices and go with well understood infrastructure. Decentralized training wins significantly on cost. It's easier to get slightly lower-end hardware in multiple locations, and you can achieve lower upfront costs and elasticity. Centralization will likely run into scale-out limits, so the largest models are likely going to be done in decentralized settings in the future. Architecturally, the progress feels similar to how things played out in the database world, where we have big monolithic databases, and then distributed and fault-tolerant options like DHTs appeared. There's a place for both, but I believe the techniques created in decentralized training will increasingly percolate to centralized settings.
I expect amazing progress in open source foundation models in 2023. We'll surpass GPT-3.5 quality in the open this year, and it will be a fairly big moment for open source. I also believe we'll continue to see optimization work that reduce the costs of working with AI, given the more low-powered resource landscape, and we'll see new architectures beyond transformers. I also expect SOTA open models in code, music, biology, and other niche areas. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 301: Retrieval-Augmented Language Models Methods
Tuesday, June 20, 2023
The ideas for decoupling model knowledge from language generation.
The Sequence Pulse: Inside Merlin, the Platform Powering Machine Learning at Shopify
Tuesday, June 20, 2023
The eCommerce giant published some details about the platform powering its ML workflows
Edge 300: Meet Falcon LLM: The Most Powerful Open Source LLM Released to Date
Tuesday, June 20, 2023
The model quickly top the Open LLM Leaderboard that ranks the performance of open source LLMs.
📝 Guest Post: Democratizing Vector Databases: Empowering Access & Equality*
Tuesday, June 20, 2023
In this guest post, Yujian Tang, Developer Advocate at Zilliz uncovers the true meaning behind democratizing a vector database and its profound implications to promote accessibility, equality, and
Yann LeCun's Vision Starts Materializing
Tuesday, June 20, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
You Might Also Like
Retro Recomendo: Gift Ideas
Sunday, November 24, 2024
Recomendo - issue #438 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Kotlin Weekly #434
Sunday, November 24, 2024
ISSUE #434 24th of November 2024 Hi Kotliners! Next week is the last one to send a paper proposal for the KotlinConf. We hope to see you there next year. Announcements State of Kotlin Scripting 2024
Weekend Reading — More time to write
Sunday, November 24, 2024
More Time to Write A fully functional clock that ticks backwards, giving you more time to write. Tech Stuff Martijn Faassen (FWIW I don't know how to use any debugger other than console.log) People
🕹️ Retro Consoles Worth Collecting While You Still Can — Is Last Year's Flagship Phone Worth Your Money?
Saturday, November 23, 2024
Also: Best Outdoor Smart Plugs, and More! How-To Geek Logo November 23, 2024 Did You Know After the "flair" that servers wore—buttons and other adornments—was made the butt of a joke in the
JSK Daily for Nov 23, 2024
Saturday, November 23, 2024
JSK Daily for Nov 23, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component
Not Ready For The Camera 📸
Saturday, November 23, 2024
What (and who) video-based social media leaves out. Here's a version for your browser. Hunting for the end of the long tail • November 23, 2024 Not Ready For The Camera Why hasn't video
Daily Coding Problem: Problem #1617 [Easy]
Saturday, November 23, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Microsoft. You are given an string representing the initial conditions of some dominoes.
Ranked | The Tallest and Shortest Countries, by Average Height 📏
Saturday, November 23, 2024
These two maps compare the world's tallest countries, and the world's shortest countries, by average height. View Online | Subscribe | Download Our App TIME IS RUNNING OUT There's just 3
⚙️ Your own Personal AI Agent, for Everything
Saturday, November 23, 2024
November 23, 2024 | Read Online Subscribe | Advertise Good Morning. Welcome to this special edition of The Deep View, brought to you in collaboration with Convergence. Imagine if you had a digital
Educational Byte: Are Privacy Coins Like Monero and Zcash Legal?
Saturday, November 23, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 23, 2024? The HackerNoon