📝 Guest post: How to Prioritize Data Quality for Computer Vision: An Expert Primer*
In this article With the rise of the data-centric AI movement (of which computer vision is a subset), the spotlight has been shifting from algorithm design to dataset development. Data is the highest contributor to model performance for many modern neural network architectures. Adding layers to the network, skipping connections, or tuning certain hyperparameters have limited model performance effects. Many practitioners spend countless hours creating and curating labeled data to train state-of-the-art architectures at the penalty of algorithm development. Additionally, dataset creation is one of the most costly and demanding components of the entire computation pipeline. Therefore, good practices for data quality are critical to ensuring successful outcomes. Why Have Data Quality Solutions Become Essential for Computer Vision?In short, the growing importance of analytics and ML applications demands modern data quality solutions:
Labeled datasets are among the most desired assets computer vision practitioners seek The 6 Dimensions of Data QualityAs a concept, data is of high quality if it fits the intended purpose of use. In the context of ML, data is of high quality if it correctly represents the real-world construct that the data describes, meaning that it is representative of the underlying population and scenarios. While good quality differs from case to case, there are common dimensions of data quality that can be measured.
The State of Data QualityConsidering that data teams identify data quality as their primary KPI while lacking tools and processes to manage that, it is not surprising that they are haunted by manual work, as many routine tasks such as testing the changes to ETL code or tracing data dependencies can take days without proper automation. They need to write ad-hoc data quality checks or ask others before using the data for their work. A few teams use automated tests and data catalogs as a source of truth for data quality.
There have not been many data quality tools that deal with unstructured visual data from my research. All of the tools mentioned above only deal with structured tabular data. Therefore, there’s an emerging opportunity to design such a tool given the untapped potential of visual data, which has a larger footprint than structured data and is powering more novel computer vision applications. Designing A Data Quality Tool For Computer VisionShould we care about the quality of our visual datasets? If the goal is to build algorithms that can understand the visual world, having high-quality datasets will be crucial. We outline below three recommendations for designing a data quality tool for computer vision. 1 - Detect and Avoid BiasTorralba and Efros, 2011 To minimize the effects of bias during dataset construction, a data quality tool for computer vision should be able to:
2 - Tackle Quality Aspects
To solve the issues associated with the aspects mentioned above, a data quality tool for computer vision should be capable of:
3 - Offer Visual AnalysesAlsallakh etl al., 2022 To improve understanding of computer vision datasets, a data quality tool for computer vision should offer visual analysis techniques mentioned above:
ConclusionThe understanding of the quality of data used to train a model, the clarity of the labeling process, and the knowledge of the strengths and weaknesses of the ground-truth data used to evaluate the models will lead to increased traceability, verification, and transparency in computer vision systems. In this article, we have given a tour of the data quality tooling landscape and proposed ideas to design a robust data quality tool for computer vision applications.
|
You’re on the free list for
Older messages
👤⚙️ Edge#196: FLUTE is Microsoft’s New Framework for Federated Learning
Thursday, June 2, 2022
The new framework enables large scale, offline simulations of federated learning scenarios
📝 Guest post: Prevent AI failure with data logging and ML monitoring*
Wednesday, June 1, 2022
Monitoring and observability for AI applications are on every organization's roadmap right now. In this guest post, our partner WhyLabs highlights the need for data and machine learning-specific
💠 Edge#195: A New Series About Graph Neural Networks
Tuesday, May 31, 2022
In this issue: we start a new series about graph neural networks (GNN); we observe how DeepMind showcases the potential of GNN; we discuss Deep Graph Library, a framework for implementing GNNs. Enjoy
🟥🟩🟦🟨 Microsoft’s New ML Announcements
Sunday, May 29, 2022
Weekly news digest curated by the industry insiders
🎙 Mike Del Balso/CEO of Tecton about Operational ML and ML Flywheels
Friday, May 27, 2022
It's so inspiring to learn from practitioners and thinkers. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your