🎙SuperAnnotate's CTO Vahan Petrosyan on the present and future of ML data labeling
Was this email forwarded to you? Sign up here It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed. 👤 Quick bio / Vahan Petrosyan
Vahan Petrosyan (VP): I am a co-founder and the CTO of SuperAnnotate, the basis of which was part of my Ph.D. research at KTH Royal Institute of Technology in 2018. During the early stages of my research, I was thinking of applying my segmentation algorithm in image editing, but after attending Conference on Computer Vision and Pattern Recognition (CVPR) in 2018, I realized that there is a much bigger opportunity to apply my research in data labeling. Once we saw the opportunity with my brother, who was a Ph.D. student in Biomedical Imaging in Switzerland, we both dropped out to start the company. Before my Ph.D. studies in ML, I studied various mathematics and statistics fields as an undergraduate and graduate student. Particularly, I was interested in Financial and Actuarial Mathematics, Quantitative Economics, Statistics, and Data Visualization. My path to ML started ten years ago when I took an ML course with prof. Adele Cutler, one of the co-creators of the legendary RandomForests algorithm. 🛠 ML Work
VP: Creating annotations is an important part of our business. While those annotations are done manually at the beginning, automation and the right data selection (namely, active learning) are areas that our customers are really excited about when using our platform. We are building the most complete platform that can not only efficiently create the ground truth for your unstructured data but also version and manage the created data/annotations. The latter becomes a lot more important for mature AI companies since ground truth data for AI engineers can be treated similarly to the code for the software developers. Therefore, as GitHub became the bread and butter for software engineering development, we are becoming the GitHub of ML Engineers, where versioning and managing their ground truth will be an integral part of any AI development.
VP: Active learning is extremely common no matter the data type you are annotating. Unfortunately, it is not yet easy to use active learning in complex AI tasks, such as multi-class instance segmentation or lidar segmentation. The differences in labeling techniques can be really big when dealing with different data types. In some complex video annotation cases, the annotation might be less time-consuming, and experts could spend more time finding the errors and fixing the annotation (i.e., quality assurance). This, generally, is not the case with image annotation when you have the entire image annotated in front of you. For text annotation, people more often perform the same task multiple times with different annotators rather than doing a quality check on already annotated documents.
VP: Self-supervised and semi-supervised techniques are a great way to increase the labeling quality. I am sure the research community will push the algorithms forward to make things learn faster with good data than big data. Such techniques should be integrated or be part of the next gen labeling platforms. In my opinion, supporting such techniques and tightly integrating them with the right management/versioning systems will become one of the key components of any successful AI project.
VP: What we generally see is that synthetic data can improve the model accuracy of certain computer vision tasks. Mixing with a simple 80-20 rule can be a really good place to start. However, generating complex scenes is often a lot harder and requires extremely detailed pixel-perfect annotations, which can be extremely time-consuming. Note that even when you use simulated data, one still needs to have the right tools to subset, manage, and version your data. Therefore, no matter how you get your ground truth, efficient data/annotation management and versioning are critical for any successful AI Project.
VP: Simple annotation editors can be found even open-source for any type of data. While providing simple editors is something many companies do, there are only a few helping rapidly scaling startups and enterprise-grade clients build sophisticated ML pipelines. Therefore, I think that most companies will not survive once the golden venture times are over. However, there will be a few GitHub scale platform solutions that will help the ML engineers take care of their precious ground truth, the backbone of AI. As to the broader ML platforms you mentioned, the reality is they currently don’t do a great job providing high-quality annotations or the right software to manage those subsequent datasets. We frequently see clients who are woefully unsatisfied with incumbent solutions turn to us for much higher quality annotations, 5-10x faster time to model, and advanced Data and ML Ops. So I’m sure there’s a strong market need and demand for companies like SuperAnnotate to be successful. 💥 Miscellaneous – a set of rapid-fire questions
Being a Statistician, deep in heart, Simpson's paradox comes first as a favorite paradox.
Probably an old favorite: Elements of Statistical Learning.
The GPT-3 comes to mind when thinking about the alternatives, but we are still far away from passing the test. So yeah, the short answer is: Yes!
Hopefully yes, in our lifetime. |
Older messages
🌆🏙 Edge#161: A New Series About Deep Generative Models
Tuesday, February 1, 2022
+Optimus, a large generative model for language tasks; +ART that uses generative models to protect neural networks
💻 Meta’s AI SuperComputer
Sunday, January 30, 2022
Weekly digest curated by industry insiders
📝 Guest post: The Original Open Source Feature Store - Hopsworks*
Friday, January 28, 2022
In TheSequence Guest Post our partners explain in detail what machine learning (ML) challenges they help deal with. This article reintroduces the core concepts of a Feature Store; the dual storage
🟢 ⚪️Edge#158: A Deep Dive Into Aporia, the ML Observability Platform
Thursday, January 27, 2022
Read it without subscription
📝 Guest post: Data Labeling and Its Role in E-commerce Today – Recent Use Cases*
Wednesday, January 26, 2022
No subscription is needed
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your