Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models Anymore
Was this email forwarded to you? Sign up here Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models AnymoreTwo major open source datasets were released this week.Next Week in The Sequence:
You can subscribe to The Sequence below:📝 Editorial: Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models AnymoreThe battle between open and closed generative AI has been at the center of industry developments. From the very beginning, the focus has been on open vs. closed models, such as Mistral and Llama vs. GPT-4 and Claude. Less attention has been paid to other foundational aspects of the model lifecycle, such as the datasets used for training and fine-tuning. In fact, one of the limitations of the so-called open weight models is that they don’t disclose the training datasets and pipeline. What if we had high-quality open source datasets that rival those used to pretrain massive foundation models? Open source datasets are one of the key aspects to unlocking innovation in generative AI. The costs required to build multi-trillion token datasets are completely prohibitive to most organizations. Leading AI labs, such as the Allen AI Institute, have been at the forefront of this idea, regularly open sourcing high-quality datasets such as the ones used for the Olmo model. Now it seems that they are getting some help. This week, we saw two major efforts related to open source generative AI datasets. Hugging Face open-sourced FineWeb, a 44TB dataset of 15 trillion tokens derived from 96 CommonCrawl snapshots. Hugging Face also released FineWeb-Edu, a subset of FineWeb focused on educational value. But Hugging Face was not the only company actively releasing open source datasets. Complementing the FineWeb release, AI startup Zyphra released Zyda, a 1.3 trillion token dataset for language modeling. The construction of Zyda seems to have focused on a very meticulous filtering and deduplication process and shows remarkable performance compared to other datasets such as Dolma or RedefinedWeb. High-quality open source datasets are paramount to enabling innovation in open generative models. Researchers using these datasets can now focus on pretraining pipelines and optimizations, while teams using those models for fine-tuning or inference can have a clearer way to explain outputs based on the composition of the dataset. The battle between open and closed generative AI is not just about models anymore. 🔎 ML ResearchExtracting Concepts from GPT-4OpenAI published a paper proposing an interpretability technique to understanding neural activity within LLMs. Specifically, the method uses k-sparse autoencoders to control sparsity which leads to more interpretable models —> Read more. Transformer are SSMsResearchers from Princeton University and Carnegie Mellon University published a paper outlining theoretical connections between transformers and SSMs. The paper also proposes a framework called state space duality and a new architecture called Mamba-2 which improves the performance over its predecessors by 2-8x —> Read more. Believe or Not Believe LLMsGoogle DeepMind published a paper proposing a technique to quantify uncertainty in LLM responses. The paper explores different sources of uncertainty such as lack of knowledge and randomness in order to quantify the reliability of an LLM output —> Read more. CodecLMGoogle Research published a paper introducing CodecLM, a framework for using synthetic data for LLM alignment in downstream tasks. CodecLM leverages LLMs like Gemini to encode seed intrstructions into the metadata and then decodes it into synthetic intstructions —> Read more. TinyAgentResearchers from UC Berkeley published a detailed blog post about TinyAgent, a function calling tuning method for small language models. TinyAgent aims to enable function calling LLMs that can run on mobile or IoT devices —> Read more. ParrotResearchers from Shanghai Jiao Tong University and Microsoft Research published a paper introducing Parrot, a framework for correlating multiple LLM requests. Parrot uses the concept of a Semantic Variable to annotate input/output variables in LLMs to enable the creation of a data pipeline with LLMs —> Read more. 🤖 Cool AI Tech ReleasesFineWebHuggingFace open sourced FineWeb, a 15 trillion token dataset for LLM training —> Read more. Stable Audion OpenStability AI open source Stable Audio Open, its new generative audio model —> Read more. Mistral Fine-TuneMistral open sourced mistral-finetune SDK and services for fine-tuning models programmatically —> Read more. ZydaZyphra Technologies open sourced Zyda, a 1.3 trillion token dataset that powers the version of its Zamba models —> Read more. 🛠 Real World AISalesforce discusses their use of Amazon SageMaker in their Einstein platform —> Read more. 📡AI Radar
You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 402: UC Berkeley's Large World Model Can Understand Really Long Videos
Thursday, June 6, 2024
One of the most impressive research in generative video of the last year. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 401: Reflection and Refinement Planning Methods in Autonomous Agents
Tuesday, June 4, 2024
Can LLM agents handle planning errorts effectively? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Generative AI Unicorn Capitulation
Monday, June 3, 2024
Adept and Humane are looking for buyers. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 399: Understanding External-Aid Planning and Autonomous Agents
Monday, June 3, 2024
How do we supply an agents with external help to improve its planning capabilities? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 400: Inside AlphaFold 3: Google DeepMind's Amazing BioScience Model
Monday, June 3, 2024
The model expands from its predecessors and is able to predict the structure of many of the life's molecules. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
📳 Galaxy Z Flip 6 Review — How to Watch the 2024 Summer Olympics for Free
Friday, July 26, 2024
Also: Fixing Spotify's Repeating Ads, and More! How-To Geek Logo July 26, 2024 Did You Know The rectangular area of a flag found in the upper left corner (top hoist corner) of the flag, such as the
Your monthly update has arrived
Friday, July 26, 2024
What's new in Google Play and Android July 2024 The Collections surface engages users with content Introducing Collections, a new on-device surface for your content Collections present users with
iOS Dev Weekly - Issue 671
Friday, July 26, 2024
There are two types of apps on the visionOS App Store. Will you create an app that makes people reach for the headset? 🥽 View on the Web Archives ISSUE 671 July 26th 2024 Comment In the last two weeks
Ranked | The 10 Busiest Ports in the World, by Cargo Traffic 🚢
Friday, July 26, 2024
As critical nodes for trade and commercial activity, we show the top 10 busiest ports in the world by cargo volume. View Online | Subscribe Presented by: Is Your Portfolio Powering the Future? >>
Let the Games Begin
Friday, July 26, 2024
Week of July 22, 2024 Let the Games Begin Week of July 22, 2024 By MG Siegler • 26 Jul 2024 View in browser View in browser Mark Zuckerberg loves two things above all else right now: llamas and
Daily Coding Problem: Problem #1508 [Hard]
Friday, July 26, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Uber. Given an array of integers, return a new array such that each element at index i
OpenAI announces SearchGPT - Weekly News Roundup - Issue #477
Friday, July 26, 2024
Plus: Will billionaires live forever; a police robot dog jamming wireless networks; Alphabet to invest $5B into Waymo; warnings about “model collapse”; a new partnership for AI security; and more! ͏ ͏
Using Data as a Product Manager
Friday, July 26, 2024
If you had your choice between a little data or a lot of data on which to guide decisions, which would you pick?
Last Mile of Blockchains: RPC and Node-as-a-Service
Friday, July 26, 2024
Top Tech Content sent at Noon! Find the hottest jobs from top tech companies Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, July 26, 2024? The
⚙️ Generative AI is making workers less productive
Friday, July 26, 2024
Plus: Runway trained video generator on thousands of YouTube videos