📝 Guest post: Using One Methodology to Solve The Three Failure Modes
Was this email forwarded to you? Sign up here In this guest post, Eric Landau, CEO of Encord, discusses the three major model failure modes that prevent models from reaching the production state and shows how to solve all three problems with a single methodology. As many ML engineers can attest, the performance of all models – even the best ones – depends on the quality of their training data. As the world moves further into the age of data-centric AI, improving the quality of training datasets becomes increasingly important, especially if we hope to see more models transition from proof-of-concept to production state. Even now, most models fail to make this transition. The technology is mostly ready, but bridging the proof of concept-production gap depends on fixing the training data quality problem. With the appropriate interventions, however, machine learning teams can improve the quality of training datasets and overcome the three major model failure modes – suboptimal data, labeling errors, and model edge cases – that prevent models from reaching the production state. However, by understanding the different problems that are holding models back from reaching production quality, ML engineers can intervene early on and solve all three problems with a single methodology. Problem One: Suboptimal Curation and SelectionEvery machine learning team knows that the data a model trains on has a significant impact on the model’s performance. Teams need to select and curate a sample of training data from a distribution of data that reflects what the model will encounter in the real world. Models need balanced datasets that cover as many edge cases as possible to avoid bias and blind spots. However, because of cost and time, machine learning teams also want to train the model with as little data as possible. Many ML teams have access to millions of images or videos, and they can’t possibly train a model using all the data that they have. Computation and cost constraints necessitate them to decide which subset of data to send to the model for training. In making this choice, machine learning teams need to select training data optimized for long-term model performance. If they act proactively, curating and selecting data for the purpose of trying to make the model more robust for real-world application, then they’ll help ensure more predictable and consistent model performance. Having the right cocktail of data to match the out-of-sample distribution is the first step in ensuring high-quality model performance. Better quality data selection and curation can help machine learning teams avoid the suboptimal data failure mode, yet looking through large datasets to find the best selection of data for training sets is challenging, inefficient, and not always practical. Problem Two: Poor Quality and Inconsistent AnnotationsAs they gained access to more and more data, ML teams have found ways to label it more efficiently. Unfortunately, as the amount of labeled data grew, the label quality problem began to reveal itself. A recent study showed that 10 of the most cited AI datasets have serious labeling errors: the famous ImageNet test set has an estimated label error of 5.8 percent. A model’s performance is not only a function of the amount of training data but also of the quality of that data’s annotations. Poorly labeled data results in a variety of model errors, such as miscategorization and geometric errors. When it comes to use cases where there’s a high sensitivity to error with regards to the consequences of a model’s mistake, such as autonomous vehicles and medical diagnosis, the labels must be specific and accurate– there’s no room for mistakes. Labels also need to be consistent across datasets. An inconsistency in the way that the data is labeled can confuse the model and harm performance. Inconsistencies often arise from having many different annotators work on a dataset. Likewise, the distribution of the labels also matters. Just like the data layer of training, the label layer needs to reflect balance and representation equal to the distribution a model encounters in the real world. Unfortunately, searching for and finding labeling errors and inconsistencies is difficult. Often they are too subtle to find in large datasets. Because labeling errors are as varied as the labelers themselves, human reviewers tend to check label quality, so annotation review becomes a time-consuming process that’s also prone to human error of its own. Problem Three: Determining and Correcting For Model Edge CasesAfter curating the best quality data and fixing its labels, ML teams need to evaluate their models with respect to the data. Taking a data-centric approach, they should seek out the failure modes of the model within the data distributions. To improve the model iteratively, they need to find the areas in which it’s struggling. A model runs on many scenarios, so the ML team needs to find the subset of scenarios on which the model isn’t doing a good job. Pinpointing the specific area in which the model is struggling can provide the ML team with actionable insight for intervening and improving performance. For instance, a computer vision model might perform well on bright images but not on dark ones. To improve model performance, the team first has to diagnose that the model struggles with dark images and then increase the number of dark images that the model trains on. Unfortunately, ML engineers tend to obtain global metrics about model performance that provide information across a wide swath of data. To improve model performance efficiently, they need to granularly decompose model performance by specific data features to provide targeted interventions that improve the composition of the dataset and by extension model performance. Encord Active: Using One Methodology to Diagnose and Fix Different Failure ModesOvercoming these three failure modes may seem daunting, but the key comes from having a better understanding of training data – where it’s succeeding and where it’s failing. Encord Active, Encord’s open-source active learning tool, uses one methodology to provide interventions for improving data and label quality across each failure mode. By running Encord Active on their data, users can better understand the makeup of their training data and how that makeup is influencing the model performance. Encord Active uses different metric functions to parametrize the data, constructing metrics on different data features. It then compiles these metrics into an index to provide information for each individual feature. With this metric methodology, users can run different diagnoses to gain insights about data quality, data labels, and model performance with respect to the data, receiving indexes about the features most appropriate to each intervention area. For data quality, Encord Active provides information about data level features such as brightness, color, aspect ratio, blurriness, and more. Users can dive into the index of individual features to explore the quality of the data relevant to that particular feature in more depth. For instance, they can look at the brightness distribution across their dataset and visualize the data with respect to that parameter. They can examine the feature distribution and outliers within their dataset and set appropriate thresholds to filter and slice the data based on the feature. When it comes to labels, the tool works similarly across a different set of parameterized features. For instance, users can see the distribution of classes, label size, annotation placement within a frame, and more. They can also examine label quality. Encord Active provides scores based on unsupervised methods, which reflect whether a label, such as the placement of a bounding box, is likely to be considered high quality or low quality by a reviewer. Finally, for those interested in seeing how their model performs with respect to the data, Encord Active breaks down model performance as a function of these different data and label features. For instance, users can evaluate a model's performance based on the brightness, “redness,” object size, or any metric in the system. Encord Active may show that a model’s true positive rate across images with small objects is low, for instance, suggesting that the model struggles to find small objects. With this information, a user could make informed, targeted improvements to model performance by increasing the number of images containing small objects in the training dataset. Because Encord Active automatically breaks down which features matter most for model performance, users can focus on improving model performance by adjusting their datasets with respect to those features. Encord Active already contains indexes for many different features. However, because the product is open source, users can build out indexes for the features most relevant to their particular use cases, parametrizing the data as specifically as is necessary to accomplish their objectives. By writing their own metric functions, they can ensure that the data is broken down in a manner most suited to their needs before using Encord Active’s visualization interface to audit and improve it. To close the proof of concept-production gap, models need better training datasets. Encord provides the tools that help all companies build, analyze, and manage better computer vision training datasets. Sign up for early access to Encord Active. *This post was written by Eric Landau, CEO of Encord. We thank Encord for their support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
🗣👥 Edge#244: This Google Model Combines Reasoning and Acting in a Single Language Model
Thursday, November 17, 2022
ReAct provides an architecture that triggers actions based on language reasoning paths
📌 Event: apply(recsys)—ML experts from Slack, ByteDance & more share their recommender system learnings
Wednesday, November 16, 2022
Are you building an ML recommender system or planning to? Then you won't want to miss apply(recsys)
🔂 Edge#243: Text-to-Image Synthesis Models – Recap
Tuesday, November 15, 2022
Our longest and the most popular series
☝️CoreWeave to Offer NVIDIA HGX H100 Supercomputers - Supporting Cutting Edge AI & ML Companies*
Monday, November 14, 2022
CoreWeave is proud to be among the first providers to offer cloud instances with NVIDIA HGX H100 supercomputers. NVIDIA's HGX H100 platform represents a major leap forward for the AI community,
✂️✂️ ML Talent Layoffs and Priorities Reset
Sunday, November 13, 2022
Weekly news digest curated by the industry insiders
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your