July 08, 2024 | Read Online

FOD#59: The Art of Crafting AI with Synthetic Data

plus a collection of thought-provoking essays, important research papers, and news from usual suspects

In partnership with

Hi there! Was this email forwarded to you? Sign up here

Next Week in Turing Post:

Wednesday, AI 101: What is LongRAG?
Friday, Interview with Innovators: we discuss the impact of AI on search engines with ML experts from Yandex Search

If you like Turing Post, consider becoming a paid subscriber. You’ll immediately get full access to all our articles, investigations, and tech series →

Upgrade to paid

The last week was marked by two very interesting research papers related to the use of synthetic data in AI, offering thought-provoking insights into the future of this technology. The first, "LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives," by Cohere researchers, explores how synthetic data can be used to fine-tune AI models. The second, "Scaling Synthetic Data Creation with 1,000,000,000 Personas," by Tencent AI Lab, unveils a colossal persona-driven framework to generate diverse and realistic synthetic data.

What if we could combine these approaches? Active inheritance allows us to guide AI models toward desirable attributes, like reducing toxicity and increasing lexical diversity. Imagine layering this with the vast, varied personas from Persona Hub. One billion personas is no joke! Could we then create a new generation of AI – a new AI Nation – trained on data that's diverse, ethically sound, and highly functional?

The potential here is immense. These papers collectively suggest a future where AI models are not just trained but finely sculpted through sophisticated data generation techniques.

There are, of course, a few questions to consider: as we move AI behavior through targeted data, how do we ensure we’re not embedding unintended biases? Data from one billion personas is massive – how do we manage it ethically and effectively? How can we make sure this AI Nation is different from us, biased humans?

Synthetic data is on the rise, and we still don’t know all the answers or even the right questions to ask. The conversation around synthetic data in AI is just beginning; the promise is truly fascinating, and it's one we must approach with both enthusiasm and caution.

Click the link below so we can make some money on this ad 🙂 You might also like what they offer →

Ship AI projects faster, cleaner and better with AE Studio

Don’t let limited talent resources or skill gaps be the reason your projects don't make it to the finish line. AE Studio's expert team of developers, data scientists, and designers can help accelerate your projects — without compromising clean code and infrastructure.

We've helped companies like Berkshire Hathaway, EVGo, and Ritual build and ship big ideas that changed the way they do business.

Have a project in the works or want to learn more about our work? Let's talk

Twitter Library

10+ Research papers to learn more about Vision Language Models (VLMs)

a list of research papers for better understanding of how VLMs work

www.turingpost.com/p/vlms-rp

Share the newsletter

News from The Usual Suspects ©

AI’s Financial situation
- Everyone discusses Sequoia Capital’s report about AI’s 600 billion bubble. An important addition to the conversation comes from Goldman Sachs’s Macros Research “GenAI: too much spend, too little benefit.”
Anthropic's Safety Dance
- Anthropic is on a mission to fund third-party evaluations of advanced AI models, focusing on AI Safety Level assessments and advanced capability metrics. Proposals are open for those eager to keep AI in check. Will safety become the new AI arms race?
Character.AI’s Love Triangle
- Character.AI, the chatbot trendsetter, is flirting with Google and Meta as competition heats up. Once the darling of quirky AI interactions, it's now navigating partnerships and content controversies to stay in the game.
Apple's AI Adventure
- Apple is joining forces with OpenAI, gaining an observer seat on its board. Phil Schiller will oversee this AI alliance, aiming to integrate ChatGPT into Apple devices and boost Siri’s smarts—all without spending a dime.
Stability AI’s Generous Diffusion
- Stability AI has dropped Stable Diffusion 3 Medium weights on Hugging Face under a new Community License. Small businesses and researchers can now use it for free, while big players need an Enterprise license. Kudos to open-source and artistic freedom!
World Artificial Intelligence Conference (WAIC) in Shanghai
- Despite U.S. restrictions, China’s AI firms continue to rival market leaders. As often happens, sanctions fuel innovations, and Chinese companies successfully develop workarounds to remain competitive. At WAIC, SenseTime unveiled SenseNova 5.5, claiming it outperforms GPT-4 in key metrics. Alibaba highlighted user growth for its Tongyi Qianwen models, which have over 20 million downloads. Both companies emphasize their commitment to open-source development amidst intense domestic competition in the AI sector.
- Elon Musk is an often visitor to China. Tesla's Optimus humanoid robot made a splash at the WAIC, though safely behind glass. Alongside it, 18 Chinese robotics firms showcased their bots, tackling high costs and US tech restrictions with creative solutions.
- Discussions centered on how Chinese companies can innovate despite US technology restrictions, focusing on areas like cloud computing and AI application development.
Kyutai’s Voice Revolution
- Kyutai introduced Moshi, the first openly accessible voice-enabled AI, created by an 8-member team in just six months. Demonstrated in Paris, Moshi's code and model weights are free to all, pushing for open collaboration in AI. I liked the reaction of Hugging Face’s CTO Julien Chaumond the most:

In other newsletters/posts (a lot of thought-provoking pieces!):

The limitations of Math, Big Data, and what it means for AGI where Devansh from AI Made Simple argues AGI is futile due to inherent limitations in data processing and math, suggesting a focus on practical AI applications instead.
Long Quan and the early wave of Chinese Computer Vision researchers by ChinAI
Vision-Language Models Booming – Great overview of VLM sector by Data Machina
Gradually, then Suddenly: Upon the Threshold – by Ethan Mollick
Safe beyond sale: post-deployment monitoring of AI by Ada Lovelace Institute
New paper: AI agents that matter by the authors (also authors of AI Snake Oil newsletter)
The paper Broadening the Scope of Noncompliance: When and How AI Models Should Not Comply with User Requests by Allen Institute of AI discusses urgent implementation of comprehensive post-deployment monitoring of AI systems to understand their real-world impacts and ensure safe use.
Thomas Wolf about Artificial Intelligence Math Olympiad

The freshest research papers, categorized for your convenience

Optimization and Performance Enhancements

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Utilizes dynamic sparse attention patterns to speed up the pre-filling stage of long-context LLMs, significantly reducing inference latency while maintaining accuracy. Read the paper
AGENTLESS: Demystifying LLM-based Software Engineering Agents
Simplifies LLM-based software development using a two-step process of localization and repair without autonomous tool usage, achieving high performance and low cost. Read the paper
RouteLLM: Learning to Route LLMs with Preference Data
Optimizes cost and performance by dynamically selecting between strong and weak LLMs, reducing costs while maintaining response quality through data augmentation and human preference data. Read the paper
LiteSearch: Efficacious Tree Search for LLM
Develops a novel tree search algorithm to improve LLMs' performance on mathematical reasoning tasks, reducing computational costs while maintaining competitive performance. Read the paper
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
Proposes Expert-Specialized Fine-Tuning (ESFT) for sparse Mixture-of-Experts (MoE) architectures, tuning only the most relevant experts for a task, improving tuning efficiency and performance. Read the paper

Benchmarks and Evaluation

TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
Presents a benchmark collection of industry-grade tabular datasets with temporal splits, highlighting the performance of different architectures and the impact of time-based splits. Read the paper
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
Proposes the SummHay task to evaluate LLMs and RAG systems on long-context summarization, highlighting models' challenges in precise citation and comprehensive coverage. Read the paper
MIRAI: Evaluating LLM Agents for Event Forecasting
Develops a benchmark for assessing LLM agents' capabilities in predicting international events using the GDELT event database, highlighting the need for advanced temporal reasoning. Read the paper
WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Introduces a benchmark for evaluating visual mathematical reasoning in LMMs, revealing significant struggles with insufficient knowledge despite advancements in generalization. Read the paper

Content Regulation, Alignment, and Safety

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
Highlights that unlearning fails to prevent reintroduction of removed knowledge through in-context learning, emphasizing the need for robust content filtering mechanisms. Read the paper
ProgressGym: Alignment with a Millennium of Moral Progress
Introduces a framework to align LLMs with human moral progress using historical texts and LLMs, offering benchmarks to track evolving values and address value lock-in risks in AI. Read the paper
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Proposes a method to defend against jailbreak attacks by unlearning harmful knowledge, significantly reducing attack success rates and demonstrating remarkable generalizability. Read the paper
A False Sense of Safety: Unsafe Information Leakage in ‘Safe’ AI Responses
Explores limitations of current AI safety measures, introducing "inferential adversaries" to exploit seemingly safe outputs, emphasizing the need for new defense mechanisms. Read the paper
Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
Develops a defense mechanism using self-evaluation to reduce attack success rates, outperforming existing defenses and remaining robust even under adaptive attacks. Read the paper

Multimodal Models and Applications

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Trains a vision model on over twenty diverse modalities, enabling it to perform a wide range of tasks without performance loss, enhancing multimodal generation and retrieval. Read the paper
Understanding alignment in multimodal LLMs: a comprehensive study
Explores alignment of responses in multimodal LLMs with image content, proposing Bias-Driven Hallucination Sampling (BDHS) and highlighting the benefits of combined offline and online methods. Read the paper
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
Integrates LLMs with the Robot Operating System (ROS) to facilitate intuitive robot programming, incorporating feedback to refine tasks, demonstrating robustness and scalability. Read the paper
STARK: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
Introduces a large-scale multi-modal conversation dataset featuring diverse social personas and images, enabling the creation of advanced conversation models with superior visual imagination abilities. Read the paper

Advanced Techniques and New Models

Chain-of-knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
Enhances LLMs with knowledge reasoning abilities using knowledge graphs and a trial-and-error mechanism, improving general reasoning capabilities and addressing rule overfitting. Read the paper
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Proposes Test-Time Training (TTT) layers, which update hidden states even during test sequences, demonstrating superior performance to Transformer and modern RNN baselines in long context scenarios. Read the paper
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Introduces a non-autoregressive zero-shot text-to-speech system with a simple architecture, achieving human-level naturalness and state-of-the-art speaker similarity and intelligibility. Read the paper

Long-Context and Retrieval Capabilities

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Argues that defining long-context NLP tasks by input length is insufficient, proposing a taxonomy to better evaluate and develop LLM capabilities in genuinely difficult long-context scenarios. Read the paper
Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
Employs instruction-tuning with enriched prompts containing definitions and guidelines, significantly improving the model's ability to generalize to unseen entity types in NER tasks. Read the paper

Novel Architectures and Techniques

Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
Enhances flow matching in generative models by enforcing self-consistency in the velocity field, improving training efficiency and sample quality. Read the paper
DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
Improves LLM performance on complex math tasks by decomposing problems into logical subtasks and incorporating self-correction, demonstrating robust generalization capabilities. Read the paper
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Utilizes dynamic sparse attention patterns to speed up the pre-filling stage of long-context LLMs, significantly reducing inference latency while maintaining accuracy. Read the paper

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!