Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models Anymore
Was this email forwarded to you? Sign up here Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models AnymoreTwo major open source datasets were released this week.Next Week in The Sequence:
You can subscribe to The Sequence below:📝 Editorial: Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models AnymoreThe battle between open and closed generative AI has been at the center of industry developments. From the very beginning, the focus has been on open vs. closed models, such as Mistral and Llama vs. GPT-4 and Claude. Less attention has been paid to other foundational aspects of the model lifecycle, such as the datasets used for training and fine-tuning. In fact, one of the limitations of the so-called open weight models is that they don’t disclose the training datasets and pipeline. What if we had high-quality open source datasets that rival those used to pretrain massive foundation models? Open source datasets are one of the key aspects to unlocking innovation in generative AI. The costs required to build multi-trillion token datasets are completely prohibitive to most organizations. Leading AI labs, such as the Allen AI Institute, have been at the forefront of this idea, regularly open sourcing high-quality datasets such as the ones used for the Olmo model. Now it seems that they are getting some help. This week, we saw two major efforts related to open source generative AI datasets. Hugging Face open-sourced FineWeb, a 44TB dataset of 15 trillion tokens derived from 96 CommonCrawl snapshots. Hugging Face also released FineWeb-Edu, a subset of FineWeb focused on educational value. But Hugging Face was not the only company actively releasing open source datasets. Complementing the FineWeb release, AI startup Zyphra released Zyda, a 1.3 trillion token dataset for language modeling. The construction of Zyda seems to have focused on a very meticulous filtering and deduplication process and shows remarkable performance compared to other datasets such as Dolma or RedefinedWeb. High-quality open source datasets are paramount to enabling innovation in open generative models. Researchers using these datasets can now focus on pretraining pipelines and optimizations, while teams using those models for fine-tuning or inference can have a clearer way to explain outputs based on the composition of the dataset. The battle between open and closed generative AI is not just about models anymore. 🔎 ML ResearchExtracting Concepts from GPT-4OpenAI published a paper proposing an interpretability technique to understanding neural activity within LLMs. Specifically, the method uses k-sparse autoencoders to control sparsity which leads to more interpretable models —> Read more. Transformer are SSMsResearchers from Princeton University and Carnegie Mellon University published a paper outlining theoretical connections between transformers and SSMs. The paper also proposes a framework called state space duality and a new architecture called Mamba-2 which improves the performance over its predecessors by 2-8x —> Read more. Believe or Not Believe LLMsGoogle DeepMind published a paper proposing a technique to quantify uncertainty in LLM responses. The paper explores different sources of uncertainty such as lack of knowledge and randomness in order to quantify the reliability of an LLM output —> Read more. CodecLMGoogle Research published a paper introducing CodecLM, a framework for using synthetic data for LLM alignment in downstream tasks. CodecLM leverages LLMs like Gemini to encode seed intrstructions into the metadata and then decodes it into synthetic intstructions —> Read more. TinyAgentResearchers from UC Berkeley published a detailed blog post about TinyAgent, a function calling tuning method for small language models. TinyAgent aims to enable function calling LLMs that can run on mobile or IoT devices —> Read more. ParrotResearchers from Shanghai Jiao Tong University and Microsoft Research published a paper introducing Parrot, a framework for correlating multiple LLM requests. Parrot uses the concept of a Semantic Variable to annotate input/output variables in LLMs to enable the creation of a data pipeline with LLMs —> Read more. 🤖 Cool AI Tech ReleasesFineWebHuggingFace open sourced FineWeb, a 15 trillion token dataset for LLM training —> Read more. Stable Audion OpenStability AI open source Stable Audio Open, its new generative audio model —> Read more. Mistral Fine-TuneMistral open sourced mistral-finetune SDK and services for fine-tuning models programmatically —> Read more. ZydaZyphra Technologies open sourced Zyda, a 1.3 trillion token dataset that powers the version of its Zamba models —> Read more. 🛠 Real World AISalesforce discusses their use of Amazon SageMaker in their Einstein platform —> Read more. 📡AI Radar
You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 402: UC Berkeley's Large World Model Can Understand Really Long Videos
Thursday, June 6, 2024
One of the most impressive research in generative video of the last year. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 401: Reflection and Refinement Planning Methods in Autonomous Agents
Tuesday, June 4, 2024
Can LLM agents handle planning errorts effectively? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Generative AI Unicorn Capitulation
Monday, June 3, 2024
Adept and Humane are looking for buyers. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 399: Understanding External-Aid Planning and Autonomous Agents
Monday, June 3, 2024
How do we supply an agents with external help to improve its planning capabilities? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 400: Inside AlphaFold 3: Google DeepMind's Amazing BioScience Model
Monday, June 3, 2024
The model expands from its predecessors and is able to predict the structure of many of the life's molecules. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
DEI? More like 'common decency' — and Silicon Valley is saying 'no thanks'
Friday, June 28, 2024
Plus, a robot with living skin and Rivian gets a boost View this email online in your browser By Haje Jan Kamps Friday, June 28, 2024 Image Credits: Bryce Durbin / TechCrunch Welcome to Startups Weekly
Daily Coding Problem: Problem #1480 [Medium]
Friday, June 28, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Amazon. Given an array of a million integers between zero and a billion, out of order,
Mapped | The World's Least Affordable Housing Markets in 2024 🏡
Friday, June 28, 2024
See which housing markets are considered 'impossibly unaffordable' according to their median price-to-income ratio. View Online | Subscribe Presented by: Feeling bulllish or bearish about
Python FIFO Buffer Class for Audio – an Algorithm
Friday, June 28, 2024
Top Tech Content sent at Noon! Join MongoDB's AI Dev Quest Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, June 28, 2024? The HackerNoon
First GPT-4o-powered smart glasses are here
Friday, June 28, 2024
New YouTube Premium features; Samsung watch deal; Best docking stations - -- ZDNET ZDNET Tech Today - US June 28, 2024 placeholder Solos unveils AirGo Vision, the world's first smart glasses with
Get your product to sell itself
Friday, June 28, 2024
The essence of product-led growth What would you think about your product acting as your best salesperson instead of relying on an army of sales reps to cold call potential customers? Users can try
Russian spies hack remote access tool TeamViewer
Friday, June 28, 2024
A hacking group that works for Russian intelligence hacked TeamViewer View this email online in your browser By Rebecca Bellan Friday, June 28, 2024 Welcome to TechCrunch AM, and a Happy Friday to you
DJI faces ban in the US - Weekly News Roundup - Issue #473
Friday, June 28, 2024
Plus: Apple delays the launch of AI tools in the EU; the first ad made with Sora; a new protein-generating AI; a humanoid robot gets its first proper job; Pope calls to ban autonomous weapons ͏ ͏ ͏ ͏ ͏
⚙️ OpenAI is making more money from its chatbots than Microsoft
Friday, June 28, 2024
Plus: AI's Two truths and a lie
Warning: TeamViewer Detects Security Breach in Corporate IT Environment
Friday, June 28, 2024
THN Daily Updates Newsletter cover Multi-Cloud Handbook for Developers ($39.99 Value) FREE for a Limited Time Unleash the power of cloud computing with Multi-Cloud Handbook for Developers, your guide