📝 Guest post: Using One Methodology to Solve The Three Failure Modes
Was this email forwarded to you? Sign up here In this guest post, Eric Landau, CEO of Encord, discusses the three major model failure modes that prevent models from reaching the production state and shows how to solve all three problems with a single methodology. As many ML engineers can attest, the performance of all models – even the best ones – depends on the quality of their training data. As the world moves further into the age of data-centric AI, improving the quality of training datasets becomes increasingly important, especially if we hope to see more models transition from proof-of-concept to production state. Even now, most models fail to make this transition. The technology is mostly ready, but bridging the proof of concept-production gap depends on fixing the training data quality problem. With the appropriate interventions, however, machine learning teams can improve the quality of training datasets and overcome the three major model failure modes – suboptimal data, labeling errors, and model edge cases – that prevent models from reaching the production state. However, by understanding the different problems that are holding models back from reaching production quality, ML engineers can intervene early on and solve all three problems with a single methodology. Problem One: Suboptimal Curation and SelectionEvery machine learning team knows that the data a model trains on has a significant impact on the model’s performance. Teams need to select and curate a sample of training data from a distribution of data that reflects what the model will encounter in the real world. Models need balanced datasets that cover as many edge cases as possible to avoid bias and blind spots. However, because of cost and time, machine learning teams also want to train the model with as little data as possible. Many ML teams have access to millions of images or videos, and they can’t possibly train a model using all the data that they have. Computation and cost constraints necessitate them to decide which subset of data to send to the model for training. In making this choice, machine learning teams need to select training data optimized for long-term model performance. If they act proactively, curating and selecting data for the purpose of trying to make the model more robust for real-world application, then they’ll help ensure more predictable and consistent model performance. Having the right cocktail of data to match the out-of-sample distribution is the first step in ensuring high-quality model performance. Better quality data selection and curation can help machine learning teams avoid the suboptimal data failure mode, yet looking through large datasets to find the best selection of data for training sets is challenging, inefficient, and not always practical. Problem Two: Poor Quality and Inconsistent AnnotationsAs they gained access to more and more data, ML teams have found ways to label it more efficiently. Unfortunately, as the amount of labeled data grew, the label quality problem began to reveal itself. A recent study showed that 10 of the most cited AI datasets have serious labeling errors: the famous ImageNet test set has an estimated label error of 5.8 percent. A model’s performance is not only a function of the amount of training data but also of the quality of that data’s annotations. Poorly labeled data results in a variety of model errors, such as miscategorization and geometric errors. When it comes to use cases where there’s a high sensitivity to error with regards to the consequences of a model’s mistake, such as autonomous vehicles and medical diagnosis, the labels must be specific and accurate– there’s no room for mistakes. Labels also need to be consistent across datasets. An inconsistency in the way that the data is labeled can confuse the model and harm performance. Inconsistencies often arise from having many different annotators work on a dataset. Likewise, the distribution of the labels also matters. Just like the data layer of training, the label layer needs to reflect balance and representation equal to the distribution a model encounters in the real world. Unfortunately, searching for and finding labeling errors and inconsistencies is difficult. Often they are too subtle to find in large datasets. Because labeling errors are as varied as the labelers themselves, human reviewers tend to check label quality, so annotation review becomes a time-consuming process that’s also prone to human error of its own. Problem Three: Determining and Correcting For Model Edge CasesAfter curating the best quality data and fixing its labels, ML teams need to evaluate their models with respect to the data. Taking a data-centric approach, they should seek out the failure modes of the model within the data distributions. To improve the model iteratively, they need to find the areas in which it’s struggling. A model runs on many scenarios, so the ML team needs to find the subset of scenarios on which the model isn’t doing a good job. Pinpointing the specific area in which the model is struggling can provide the ML team with actionable insight for intervening and improving performance. For instance, a computer vision model might perform well on bright images but not on dark ones. To improve model performance, the team first has to diagnose that the model struggles with dark images and then increase the number of dark images that the model trains on. Unfortunately, ML engineers tend to obtain global metrics about model performance that provide information across a wide swath of data. To improve model performance efficiently, they need to granularly decompose model performance by specific data features to provide targeted interventions that improve the composition of the dataset and by extension model performance. Encord Active: Using One Methodology to Diagnose and Fix Different Failure ModesOvercoming these three failure modes may seem daunting, but the key comes from having a better understanding of training data – where it’s succeeding and where it’s failing. Encord Active, Encord’s open-source active learning tool, uses one methodology to provide interventions for improving data and label quality across each failure mode. By running Encord Active on their data, users can better understand the makeup of their training data and how that makeup is influencing the model performance. Encord Active uses different metric functions to parametrize the data, constructing metrics on different data features. It then compiles these metrics into an index to provide information for each individual feature. With this metric methodology, users can run different diagnoses to gain insights about data quality, data labels, and model performance with respect to the data, receiving indexes about the features most appropriate to each intervention area. For data quality, Encord Active provides information about data level features such as brightness, color, aspect ratio, blurriness, and more. Users can dive into the index of individual features to explore the quality of the data relevant to that particular feature in more depth. For instance, they can look at the brightness distribution across their dataset and visualize the data with respect to that parameter. They can examine the feature distribution and outliers within their dataset and set appropriate thresholds to filter and slice the data based on the feature. When it comes to labels, the tool works similarly across a different set of parameterized features. For instance, users can see the distribution of classes, label size, annotation placement within a frame, and more. They can also examine label quality. Encord Active provides scores based on unsupervised methods, which reflect whether a label, such as the placement of a bounding box, is likely to be considered high quality or low quality by a reviewer. Finally, for those interested in seeing how their model performs with respect to the data, Encord Active breaks down model performance as a function of these different data and label features. For instance, users can evaluate a model's performance based on the brightness, “redness,” object size, or any metric in the system. Encord Active may show that a model’s true positive rate across images with small objects is low, for instance, suggesting that the model struggles to find small objects. With this information, a user could make informed, targeted improvements to model performance by increasing the number of images containing small objects in the training dataset. Because Encord Active automatically breaks down which features matter most for model performance, users can focus on improving model performance by adjusting their datasets with respect to those features. Encord Active already contains indexes for many different features. However, because the product is open source, users can build out indexes for the features most relevant to their particular use cases, parametrizing the data as specifically as is necessary to accomplish their objectives. By writing their own metric functions, they can ensure that the data is broken down in a manner most suited to their needs before using Encord Active’s visualization interface to audit and improve it. To close the proof of concept-production gap, models need better training datasets. Encord provides the tools that help all companies build, analyze, and manage better computer vision training datasets. Sign up for early access to Encord Active. *This post was written by Eric Landau, CEO of Encord. We thank Encord for their support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
🗣👥 Edge#244: This Google Model Combines Reasoning and Acting in a Single Language Model
Thursday, November 17, 2022
ReAct provides an architecture that triggers actions based on language reasoning paths
📌 Event: apply(recsys)—ML experts from Slack, ByteDance & more share their recommender system learnings
Wednesday, November 16, 2022
Are you building an ML recommender system or planning to? Then you won't want to miss apply(recsys)
🔂 Edge#243: Text-to-Image Synthesis Models – Recap
Tuesday, November 15, 2022
Our longest and the most popular series
☝️CoreWeave to Offer NVIDIA HGX H100 Supercomputers - Supporting Cutting Edge AI & ML Companies*
Monday, November 14, 2022
CoreWeave is proud to be among the first providers to offer cloud instances with NVIDIA HGX H100 supercomputers. NVIDIA's HGX H100 platform represents a major leap forward for the AI community,
✂️✂️ ML Talent Layoffs and Priorities Reset
Sunday, November 13, 2022
Weekly news digest curated by the industry insiders
You Might Also Like
Data Science Weekly - Issue 589
Friday, March 7, 2025
Curated news, articles and jobs related to Data Science, AI, & Machine Learning ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
📱 Issue 453 - Does iOS have sideloading yet?
Thursday, March 6, 2025
This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 453 Release Date Mar 06, 2025 Your weekly report of the most popular iOS news, articles and projects Popular
💻 Issue 452 - Pro .NET Memory Management 2nd Edition
Thursday, March 6, 2025
This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 452 Release Date Mar 06, 2025 Your weekly report of the most popular .NET news, articles and projects
💎 Issue 459 - What's the Deal with (Ruby) Ractors?
Thursday, March 6, 2025
This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 459 Release Date Mar 06, 2025 Your weekly report of the most popular Ruby news, articles and
💻 Issue 459 - 7 Best Practices of File Upload With JavaScript
Thursday, March 6, 2025
This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 459 Release Date Mar 06, 2025 Your weekly report of the most popular Node.js news, articles and
💻 Issue 459 - TanStack Form V1 - Type-safe, Agnostic, Headless Form Library
Thursday, March 6, 2025
This week's Awesome JavaScript Weekly Read this email on the Web The Awesome JavaScript Weekly Issue » 459 Release Date Mar 06, 2025 Your weekly report of the most popular JavaScript news, articles
💻 Issue 454 - Take a break: Rust match has fallthrough
Thursday, March 6, 2025
This week's Awesome Rust Weekly Read this email on the Web The Awesome Rust Weekly Issue » 454 Release Date Mar 06, 2025 Your weekly report of the most popular Rust news, articles and projects
💻 Issue 377 - TanStack Form V1 - Type-safe, Agnostic, Headless Form Library
Thursday, March 6, 2025
This week's Awesome React Weekly Read this email on the Web The Awesome React Weekly Issue » 377 Release Date Mar 06, 2025 Your weekly report of the most popular React news, articles and projects
📱 Issue 456 - Safer Swift: How ~Copyable Prevents Hidden Bugs
Thursday, March 6, 2025
This week's Awesome Swift Weekly Read this email on the Web The Awesome Swift Weekly Issue » 456 Release Date Mar 06, 2025 Your weekly report of the most popular Swift news, articles and projects
JSK Daily for Mar 6, 2025
Thursday, March 6, 2025
JSK Daily for Mar 6, 2025 View this email in your browser A community curated daily e-mail of JavaScript news Build a Dynamic Watchlist for Your Web App with Angular & GraphQL (Part 6) In this