📝 Guest Post: LLMs & humans: The perfect duo for data labeling
Was this email forwarded to you? Sign up here 📝 Guest Post: LLMs & humans: The perfect duo for data labelingHow to build a pipeline to achieve superhuman qualityIn this guest post, Sergei Tilga, R&D Lead at Toloka AI, offers practical insights on the power of combining Large Language Models (LLMs) and human annotators in data labeling projects to optimize for quality while also reducing costs. Rooted in real-world results and examples, this article provides a valuable guide for anyone looking to maximize the potential of LLMs and human input in data labeling processes. You may have heard that LLMs are faster, cheaper, and better than humans at text annotation. Does this mean we no longer need human data labeling? Not exactly. Based on our experience using LLMs on real-world text annotation projects, even the latest state-of-the-art models aren’t meeting quality expectations. What’s more, these models aren’t always cheaper than data labeling with human annotators. But we’ve found that it is possible to elevate data quality by using an optimal mix of human and LLM labeling. Comparing the quality of LLMs and human labelingThere are a growing number of general-use LLMs publicly available, and they all belong to one of two camps: open source or closed API. Open source models are generally much cheaper to run. You can see comparisons of overall model performance on Meta’s benchmarks table, in this paper, and on the LMSYS leaderboard. However, most evaluation projects are based on open-source datasets. To get a clear picture of LLM performance, we need to compare output on real-world projects as well. We’ve been testing multiple LLMs on our own data labeling projects and comparing them to human labeling with a crowd of trained annotators. To assess the quality of both humans and models, we compare their labels to ground truth labels prepared by experts. For most of the projects we run, we see good results from two models in particular:
The dilemma of optimizing costs vs qualityIf we’re talking about an open source model (Llama 2) where we only pay for GPU usage, data labeling might be hundreds of times cheaper than human crowd labeling. But what’s the tradeoff? It all depends on the complexity of the task. Our experiments have achieved near-human accuracy with Llama 2 on sentiment analysis, spam detection, and a few other types of tasks. The inference price is generally low, but there are some other expenses involved. You need to have the infrastructure to run fine-tuning and inference, and you need to collect a human-labeled dataset for fine-tuning at the outset. GPT-4 provides acceptable results right out of the gate. You just need to write a detailed prompt with task instructions and examples in text format. However, the cost of inference is either more expensive or only slightly cheaper than human labeling — and the quality is lower. The good news is you don’t have to choose between using a model or using human annotation, and you don’t have to sacrifice quality to gain efficiency. We get the best results when we maximize the capabilities of human labeling and LLM labeling at the same time. By carefully analyzing output, we often find subsets of data where a fine-tuned LLM performs better than humans. We can take advantage of this strength to achieve overall quality that is better than human-only or LLM-only labeling. Does this mean we have a silver bullet to enhance quality without spending more money? Absolutely. The silver bullet: intelligent hybrid pipelinesToloka uses hybrid pipelines, meaning we add an LLM to the data labeling pipeline alongside humans. When done correctly, this approach can optimize costs while achieving unprecedented quality. So how do we structure a hybrid pipeline? When LLMs output a label, they can also output the confidence level. We can use this information in a pipeline that combines LLM and human labeling. The approach is simple: the LLM labels the data, and labels with low confidence are sent to the crowd for relabeling. We can change the confidence threshold to adjust the amount of data that is relabeled by humans and control the quality of the final labels. For typical tasks, the more we use human labeling, the higher the cost. The trick is to find the right threshold for model confidence to optimize for quality or cost, depending on our goal. We can choose any point on this curve and achieve the cost-quality trade-off that is needed for a specific task. In some cases, quality optimization with LLMs just doesn’t work. This usually happens on complex tasks where the model’s quality lags far behind human quality. We can still use the model for cost optimization and get results that are very close to human accuracy, as shown below. Examples of cost-optimized pipelinesHere are some real-life results of using cost-optimized hybrid pipelines, compared to human-only and LLM-only workflows. The potential for optimization strongly depends on the complexity of the task. Naturally, the further the model’s output lags behind human quality, the less we can automate. Examples of quality-optimized pipelinesThis table shows results on quality-optimized pipelines, where we leveraged the LLM to achieve better-than-human quality. On these tasks, the optimized hybrid pipeline performed impressively well compared to humans alone, while also reducing costs.
Human + LLM pipelines in productionReliable quality is our number one priority, and we build quality control steps into every data labeling pipeline that goes into production. We check up to 5% of the final labels — both LLM output and human output — using validation by experts. We then use the results to build metrics that help us continually monitor the overall quality of the pipeline. So how do we mitigate quality issues? If quality drops, the standard solution is to add more data and fine-tune the model again. Sometimes the data keeps getting more complex and we don’t see improvement in the model’s output. In this case, we can adjust the confidence threshold for the model so that a smaller percentage of data is labeled by the model, and more data is labeled by humans. With the right expertise for fine-tuning and the perfect threshold to balance LLM and human labeling, hybrid pipelines can produce exceptional quality in many cases. Preparing your own pipeline: paving the way to successTo run your own hybrid data labeling pipeline that leverages LLMs and human insight, you’ll need to lay the groundwork by covering 4 essential steps:
Toloka can help you in every stage of the AI development process. From finding the right LLM for your task, to taking care of fine-tuning, and designing a hybrid pipeline tailored to your needs — our team is here to support you every step of the way. The decision to optimize for quality or cost is always up to you! *This post was written by Sergei Tilga, R&D Lead at Toloka AI. We thank Toloka for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Fuyu-8B Makes the Case for Simple, Fast, and Powerful Generative AI Models
Sunday, October 22, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
❗️🔎 Your expertise needed: weigh in on the ML Insider 2023 Survey
Friday, October 20, 2023
Take the ML Insider Survey Share your experience developing ML and compare it with other ML experts. We're happy to support cnvrg.io in creating the ML Insider Report. They reach out to thousands
Inside OPRO: Google DeepMind’s New Method that Optimizes Prompts Better than Humans
Thursday, October 19, 2023
The technique uses LLMs as prompt optimization agents.
LLM Scaling Laws vs. Everything Else
Thursday, October 19, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
📝 Guest Post: Retrieval Augmented Generation on Notion Docs via LangChain*
Thursday, October 19, 2023
In this guest post, Yujian Tang, a developer advocate at Zilliz, explores how to enhance Notion documents with language model interactions using LangChain and Milvus. He lays out a step-by-step guide
You Might Also Like
Weekend Reading — More time to write
Sunday, November 24, 2024
More Time to Write A fully functional clock that ticks backwards, giving you more time to write. Tech Stuff Martijn Faassen (FWIW I don't know how to use any debugger other than console.log) People
🕹️ Retro Consoles Worth Collecting While You Still Can — Is Last Year's Flagship Phone Worth Your Money?
Saturday, November 23, 2024
Also: Best Outdoor Smart Plugs, and More! How-To Geek Logo November 23, 2024 Did You Know After the "flair" that servers wore—buttons and other adornments—was made the butt of a joke in the
JSK Daily for Nov 23, 2024
Saturday, November 23, 2024
JSK Daily for Nov 23, 2024 View this email in your browser A community curated daily e-mail of JavaScript news React E-Commerce App for Digital Products: Part 4 (Creating the Home Page) This component
Not Ready For The Camera 📸
Saturday, November 23, 2024
What (and who) video-based social media leaves out. Here's a version for your browser. Hunting for the end of the long tail • November 23, 2024 Not Ready For The Camera Why hasn't video
Daily Coding Problem: Problem #1617 [Easy]
Saturday, November 23, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Microsoft. You are given an string representing the initial conditions of some dominoes.
Ranked | The Tallest and Shortest Countries, by Average Height 📏
Saturday, November 23, 2024
These two maps compare the world's tallest countries, and the world's shortest countries, by average height. View Online | Subscribe | Download Our App TIME IS RUNNING OUT There's just 3
⚙️ Your own Personal AI Agent, for Everything
Saturday, November 23, 2024
November 23, 2024 | Read Online Subscribe | Advertise Good Morning. Welcome to this special edition of The Deep View, brought to you in collaboration with Convergence. Imagine if you had a digital
Educational Byte: Are Privacy Coins Like Monero and Zcash Legal?
Saturday, November 23, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 23, 2024? The HackerNoon
🐍 New Python tutorials on Real Python
Saturday, November 23, 2024
Hey there, There's always something going on over at Real Python as far as Python tutorials go. Here's what you may have missed this past week: Black Friday Giveaway @ Real Python This Black
Re: Hackers may have stolen everyone's SSN!
Saturday, November 23, 2024
I wanted to make sure you saw Incogni's Black Friday deal, which is exclusively available for iPhone Life readers. Use coupon code IPHONELIFE to save 58%. Here's why we recommend Incogni for