Was this email forwarded to you? Sign up here

📝 Guest post: Getting ML data labeling right*

Jul 29

What’s the best way to gather and label your data? There’s no trivial answer as every project is unique. But you can get some ideas from a similar case to yours. In this article, Toloka’s team shares their data labeling experience, looking into three different case studies.

Introduction

The number of AI products is skyrocketing with each passing year. Just a few years ago, AI and ML were accessible mostly to the industry’s bigger players – little companies didn’t have the resources to build and release quality AI products. But that landscape has changed, and even small enterprises have entered the game.

While building production-ready AI solutions is a long process, it starts with gathering and labeling the data used to train ML models.

What’s the best way to make that happen? There’s no easy answer to that question since every project is unique. But this article will look at three different case studies we’ve worked on.

1. Audio transcription

For this project, we needed to transcribe large amounts of voice recordings to train an Automatic Speech Recognition (ASR) model. We divided the recordings into small audio clips and asked the Toloka crowd to transcribe what they heard. The interface was similar to what you can see below.

The transcriptions were easy to pause as needed. We also distributed the same audio clip to a large crowd to ensure quality, only accepting their work when there were a certain number of matches. The pipeline below visualizes that process.

The annotators who took part in the project went through training as well. Almost no matter the project, we recommend including this step. The training can include tasks similar to the ones required for the project and offer a careful explanation when the annotators make mistakes. That teaches them what exactly they need to do.

2. Evaluating translations

Another interesting project we worked on was evaluating translation systems for a major machine translation conference. Each annotator was given a source text and a couple of candidate translations. They were then asked how well each translation conveyed the semantics of the original text, the process shown in the interface below.

With this project, we were very careful about crowd selection, starting with a mandatory language test and exam.

During the evaluation process, we constantly monitored participant performance using control tasks, which we knew the answer to. These tasks were mixed in with regular tasks so they couldn’t be detected. If annotator quality fell below 80%, they were required to pass the initial test again.

As a result of this project, we have annotated more than one hundred language pairs, including some common ones but also covering more exotic cases such as Xhosa – Zulu or Bengali – Hindi. Those are especially challenging due to a limited number of annotators.

3. Search relevance

One of the most popular use cases of Toloka is offline metrics evaluation of search relevance. Offline evaluation is needed since online feedback is not explicit, long-term, and hard to judge, for example, dwell-time & clicks. Offline metrics evaluation enables us to focus on separate search characteristics and receive explicit signals about search relevance.

To do that, we show annotators a search query and a picture of a potentially matching item and ask them to rate relevance. The interface we use for this type of task is shown below.

We add buttons to search on Google or Amazon. That helps when the query is a bit obscure or difficult to understand because the annotator can follow the links and see what products match the query in those search engines.

Annotators often pick surprising answers. Intuitively, many clients go with a scale system like the one in the red box below when designing tasks.

With that said, our experiments have shown it is better to choose the categories seen in the green box: exact match, possible replace, accessory, and irrelevant. But that makes it incumbent on us to clearly define the categories in the instructions and during training.

In this project, we had to provide detailed descriptions of what accessory and possible replace classes are. Additionally, each task was sent to multiple annotators whose quality was constantly checked using hidden control tasks. Our experience has shown that quality control checks are crucial components of each annotation project, and we always do it on projects we design.

Summary

This article has covered three case studies from our data labeling experience. We have shown you examples of data annotation projects for audio transcription, MT evaluation, and search engine evaluation. We hope they’ve given you a taste of how we approach this problem and an idea of how you can prepare data for your own ML project. But since every project is different, you’re welcome to join our slack channel if you have a challenge you’d like to discuss.

*This post was written by Toloka’s team. We thank Toloka for their ongoing support of TheSequence.

Like

Comment

Share

You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities.

📝 Guest post: Getting ML data labeling right*

📝 Guest post: Getting ML data labeling right*

Introduction

1. Audio transcription

2. Evaluating translations

3. Search relevance

Summary

*This post was written by Toloka’s team. We thank Toloka for their ongoing support of TheSequence.

Older messages

👁 Edge#212: Inside the Masterful CLI Trainer, a low-code CV model development platform

🎙 Ran Romano/Qwak about bridging the gap between data science and ML engineering

🤷🏻 Edge#211: What to Test in ML Models

🗄 A Model Compression Library You Need to Know About

📌 Event: Join us for this live webinar to learn how Tide reduced model deployment time by 50%!

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR