🎙SuperAnnotate's CTO Vahan Petrosyan on the present and future of ML data labeling
Was this email forwarded to you? Sign up here It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed. 👤 Quick bio / Vahan Petrosyan
Vahan Petrosyan (VP): I am a co-founder and the CTO of SuperAnnotate, the basis of which was part of my Ph.D. research at KTH Royal Institute of Technology in 2018. During the early stages of my research, I was thinking of applying my segmentation algorithm in image editing, but after attending Conference on Computer Vision and Pattern Recognition (CVPR) in 2018, I realized that there is a much bigger opportunity to apply my research in data labeling. Once we saw the opportunity with my brother, who was a Ph.D. student in Biomedical Imaging in Switzerland, we both dropped out to start the company. Before my Ph.D. studies in ML, I studied various mathematics and statistics fields as an undergraduate and graduate student. Particularly, I was interested in Financial and Actuarial Mathematics, Quantitative Economics, Statistics, and Data Visualization. My path to ML started ten years ago when I took an ML course with prof. Adele Cutler, one of the co-creators of the legendary RandomForests algorithm. 🛠 ML Work
VP: Creating annotations is an important part of our business. While those annotations are done manually at the beginning, automation and the right data selection (namely, active learning) are areas that our customers are really excited about when using our platform. We are building the most complete platform that can not only efficiently create the ground truth for your unstructured data but also version and manage the created data/annotations. The latter becomes a lot more important for mature AI companies since ground truth data for AI engineers can be treated similarly to the code for the software developers. Therefore, as GitHub became the bread and butter for software engineering development, we are becoming the GitHub of ML Engineers, where versioning and managing their ground truth will be an integral part of any AI development.
VP: Active learning is extremely common no matter the data type you are annotating. Unfortunately, it is not yet easy to use active learning in complex AI tasks, such as multi-class instance segmentation or lidar segmentation. The differences in labeling techniques can be really big when dealing with different data types. In some complex video annotation cases, the annotation might be less time-consuming, and experts could spend more time finding the errors and fixing the annotation (i.e., quality assurance). This, generally, is not the case with image annotation when you have the entire image annotated in front of you. For text annotation, people more often perform the same task multiple times with different annotators rather than doing a quality check on already annotated documents.
VP: Self-supervised and semi-supervised techniques are a great way to increase the labeling quality. I am sure the research community will push the algorithms forward to make things learn faster with good data than big data. Such techniques should be integrated or be part of the next gen labeling platforms. In my opinion, supporting such techniques and tightly integrating them with the right management/versioning systems will become one of the key components of any successful AI project.
VP: What we generally see is that synthetic data can improve the model accuracy of certain computer vision tasks. Mixing with a simple 80-20 rule can be a really good place to start. However, generating complex scenes is often a lot harder and requires extremely detailed pixel-perfect annotations, which can be extremely time-consuming. Note that even when you use simulated data, one still needs to have the right tools to subset, manage, and version your data. Therefore, no matter how you get your ground truth, efficient data/annotation management and versioning are critical for any successful AI Project.
VP: Simple annotation editors can be found even open-source for any type of data. While providing simple editors is something many companies do, there are only a few helping rapidly scaling startups and enterprise-grade clients build sophisticated ML pipelines. Therefore, I think that most companies will not survive once the golden venture times are over. However, there will be a few GitHub scale platform solutions that will help the ML engineers take care of their precious ground truth, the backbone of AI. As to the broader ML platforms you mentioned, the reality is they currently don’t do a great job providing high-quality annotations or the right software to manage those subsequent datasets. We frequently see clients who are woefully unsatisfied with incumbent solutions turn to us for much higher quality annotations, 5-10x faster time to model, and advanced Data and ML Ops. So I’m sure there’s a strong market need and demand for companies like SuperAnnotate to be successful. 💥 Miscellaneous – a set of rapid-fire questions
Being a Statistician, deep in heart, Simpson's paradox comes first as a favorite paradox.
Probably an old favorite: Elements of Statistical Learning.
The GPT-3 comes to mind when thinking about the alternatives, but we are still far away from passing the test. So yeah, the short answer is: Yes!
Hopefully yes, in our lifetime. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
🌆🏙 Edge#161: A New Series About Deep Generative Models
Tuesday, February 1, 2022
+Optimus, a large generative model for language tasks; +ART that uses generative models to protect neural networks
💻 Meta’s AI SuperComputer
Sunday, January 30, 2022
Weekly digest curated by industry insiders
📝 Guest post: The Original Open Source Feature Store - Hopsworks*
Friday, January 28, 2022
In TheSequence Guest Post our partners explain in detail what machine learning (ML) challenges they help deal with. This article reintroduces the core concepts of a Feature Store; the dual storage
🟢 ⚪️Edge#158: A Deep Dive Into Aporia, the ML Observability Platform
Thursday, January 27, 2022
Read it without subscription
📝 Guest post: Data Labeling and Its Role in E-commerce Today – Recent Use Cases*
Wednesday, January 26, 2022
No subscription is needed
You Might Also Like
🔒 The Vault Newsletter: November issue 🔑
Monday, November 25, 2024
Get the latest business security news, updates, and advice from 1Password. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
🧐 The Most Interesting Phones You Didn't See in 2024 — Making Reddit Faster on Older Devices
Monday, November 25, 2024
Also: Best Black Friday Deals So Far, and More! How-To Geek Logo November 25, 2024 Did You Know If you look closely over John Lennon's shoulder on the iconic cover of The Beatles Abbey Road album,
JSK Daily for Nov 25, 2024
Monday, November 25, 2024
JSK Daily for Nov 25, 2024 View this email in your browser A community curated daily e-mail of JavaScript news JavaScript Certification Black Friday Offer – Up to 54% Off! Certificates.dev, the trusted
Ranked | How Americans Rate Business Figures 📊
Monday, November 25, 2024
This graphic visualizes the results of a YouGov survey that asks Americans for their opinions on various business figures. View Online | Subscribe Presented by: Non-consensus strategies that go where
Spyglass Dispatch: Apple Throws Their Film to the Wolves • The AI Supercomputer Arms Race • Sony's Mobile Game • The EU Hunts Bluesky • Bluesky Hunts User Trust • 'Glicked' Pricked • One Massive iPad
Monday, November 25, 2024
Apple Throws Their Film to the Wolves • The AI Supercomputer Arms Race • Sony's Mobile Game • The EU Hunts Bluesky • Bluesky Hunts User Trust • 'Glicked' Pricked • One Massive iPad The
Daily Coding Problem: Problem #1619 [Hard]
Monday, November 25, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given two non-empty binary trees s and t , check whether tree t has exactly the
Unpacking “Craft” in the Software Interface & The Five Pillars of Creative Flow
Monday, November 25, 2024
Systems Over Substance, Anytype's autumn updates, Ghost's progress with its ActivityPub integration, and a lot more in this week's issue of Creativerly. Creativerly Unpacking “Craft” in the
What Investors Want From AI Startups in 2025
Monday, November 25, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 25, 2024? The HackerNoon
GCP Newsletter #426
Monday, November 25, 2024
Welcome to issue #426 November 25th, 2024 News LLM Official Blog Vertex AI Announcing Mistral AI's Large-Instruct-2411 on Vertex AI - Google Cloud has announced the availability of Mistral AI's
⏳ 36 Hours Left: Help Get "The Art of Data" Across the Finish Line 🏁
Monday, November 25, 2024
Visual Capitalist plans to unveal its secrets behind data storytelling, but only if the book hits its minimum funding goal. View Online | Subscribe | Download Our App We Need Your Help Only 36 Hours