🎙SuperAnnotate's CTO Vahan Petrosyan on the present and future of ML data labeling
Was this email forwarded to you? Sign up here It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed. 👤 Quick bio / Vahan Petrosyan
Vahan Petrosyan (VP): I am a co-founder and the CTO of SuperAnnotate, the basis of which was part of my Ph.D. research at KTH Royal Institute of Technology in 2018. During the early stages of my research, I was thinking of applying my segmentation algorithm in image editing, but after attending Conference on Computer Vision and Pattern Recognition (CVPR) in 2018, I realized that there is a much bigger opportunity to apply my research in data labeling. Once we saw the opportunity with my brother, who was a Ph.D. student in Biomedical Imaging in Switzerland, we both dropped out to start the company. Before my Ph.D. studies in ML, I studied various mathematics and statistics fields as an undergraduate and graduate student. Particularly, I was interested in Financial and Actuarial Mathematics, Quantitative Economics, Statistics, and Data Visualization. My path to ML started ten years ago when I took an ML course with prof. Adele Cutler, one of the co-creators of the legendary RandomForests algorithm. 🛠 ML Work
VP: Creating annotations is an important part of our business. While those annotations are done manually at the beginning, automation and the right data selection (namely, active learning) are areas that our customers are really excited about when using our platform. We are building the most complete platform that can not only efficiently create the ground truth for your unstructured data but also version and manage the created data/annotations. The latter becomes a lot more important for mature AI companies since ground truth data for AI engineers can be treated similarly to the code for the software developers. Therefore, as GitHub became the bread and butter for software engineering development, we are becoming the GitHub of ML Engineers, where versioning and managing their ground truth will be an integral part of any AI development.
VP: Active learning is extremely common no matter the data type you are annotating. Unfortunately, it is not yet easy to use active learning in complex AI tasks, such as multi-class instance segmentation or lidar segmentation. The differences in labeling techniques can be really big when dealing with different data types. In some complex video annotation cases, the annotation might be less time-consuming, and experts could spend more time finding the errors and fixing the annotation (i.e., quality assurance). This, generally, is not the case with image annotation when you have the entire image annotated in front of you. For text annotation, people more often perform the same task multiple times with different annotators rather than doing a quality check on already annotated documents.
VP: Self-supervised and semi-supervised techniques are a great way to increase the labeling quality. I am sure the research community will push the algorithms forward to make things learn faster with good data than big data. Such techniques should be integrated or be part of the next gen labeling platforms. In my opinion, supporting such techniques and tightly integrating them with the right management/versioning systems will become one of the key components of any successful AI project.
VP: What we generally see is that synthetic data can improve the model accuracy of certain computer vision tasks. Mixing with a simple 80-20 rule can be a really good place to start. However, generating complex scenes is often a lot harder and requires extremely detailed pixel-perfect annotations, which can be extremely time-consuming. Note that even when you use simulated data, one still needs to have the right tools to subset, manage, and version your data. Therefore, no matter how you get your ground truth, efficient data/annotation management and versioning are critical for any successful AI Project.
VP: Simple annotation editors can be found even open-source for any type of data. While providing simple editors is something many companies do, there are only a few helping rapidly scaling startups and enterprise-grade clients build sophisticated ML pipelines. Therefore, I think that most companies will not survive once the golden venture times are over. However, there will be a few GitHub scale platform solutions that will help the ML engineers take care of their precious ground truth, the backbone of AI. As to the broader ML platforms you mentioned, the reality is they currently don’t do a great job providing high-quality annotations or the right software to manage those subsequent datasets. We frequently see clients who are woefully unsatisfied with incumbent solutions turn to us for much higher quality annotations, 5-10x faster time to model, and advanced Data and ML Ops. So I’m sure there’s a strong market need and demand for companies like SuperAnnotate to be successful. 💥 Miscellaneous – a set of rapid-fire questions
Being a Statistician, deep in heart, Simpson's paradox comes first as a favorite paradox.
Probably an old favorite: Elements of Statistical Learning.
The GPT-3 comes to mind when thinking about the alternatives, but we are still far away from passing the test. So yeah, the short answer is: Yes!
Hopefully yes, in our lifetime. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
🌆🏙 Edge#161: A New Series About Deep Generative Models
Tuesday, February 1, 2022
+Optimus, a large generative model for language tasks; +ART that uses generative models to protect neural networks
💻 Meta’s AI SuperComputer
Sunday, January 30, 2022
Weekly digest curated by industry insiders
📝 Guest post: The Original Open Source Feature Store - Hopsworks*
Friday, January 28, 2022
In TheSequence Guest Post our partners explain in detail what machine learning (ML) challenges they help deal with. This article reintroduces the core concepts of a Feature Store; the dual storage
🟢 ⚪️Edge#158: A Deep Dive Into Aporia, the ML Observability Platform
Thursday, January 27, 2022
Read it without subscription
📝 Guest post: Data Labeling and Its Role in E-commerce Today – Recent Use Cases*
Wednesday, January 26, 2022
No subscription is needed
You Might Also Like
Kotlin Weekly #407
Sunday, May 19, 2024
ISSUE #407 19th of May 2024 Hello Kotliners! The Google I/O just finished this week with a huge announcement for us, with Google supporting now Kotlin Multiplatform on Android, and the KotlinConf will
Learn How to Use AI to Reach Your Full Potential, newsletterest1!
Sunday, May 19, 2024
3 Ways AI Can Help Your Writing ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Software Testing Weekly - Issue 220
Saturday, May 18, 2024
Software Testing Conferences 📚 View on the Web Archives ISSUE 220 May 18th 2024 COMMENT Welcome to the 220th issue! Have you ever been to a testing conference? They're a great way to learn about
📶 Is a Cellular iPad Worth It? — How to Prevent YouTube From Taking Over Your Screensaver
Saturday, May 18, 2024
Also: This Robot Vacuum Can Clean Stairs, and More! How-To Geek Logo May 18, 2024 📩 Get expert reviews, the hottest deals, how-to's, breaking news, and more delivered directly to your inbox by
Weekend Reading — Objection-oriented programming
Saturday, May 18, 2024
This week we find a power-up box, replace GitHub Actions with Maven XMLs, avoid the worst website in the world, revisit RTO policies, “listen” to OpenAI employees, watch our Slack private messages, do
Daily Coding Problem: Problem #1445 [Easy]
Saturday, May 18, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Jane Street. The United States uses the imperial system of weights and measures, which
You don’t have to take our word for it…
Saturday, May 18, 2024
You can probably tell how excited we are to re-launch our Gigantic courses – which bring on-demand product management training for today's modern Product Managers and Product Leaders. In fact, we
🐍 New Python tutorials on Real Python
Saturday, May 18, 2024
Hey there, There's always something going on over at realpython.com as far as Python tutorials go. Here's what you may have missed this past week: What Is the __pycache__ Folder in Python? In
Visualized | Life Expectancy by Region (1950-2050F) 📊
Saturday, May 18, 2024
This map shows life expectancy at birth for key global regions, from 1950 to 2050F. View Online | Subscribe Presented by Voronoi: The App Where Data Tells the Story FEATURED STORY Life Expectancy by
New Wi-Fi Vulnerability Enables Network Eavesdropping via Downgrade Attacks
Saturday, May 18, 2024
THN Daily Updates Newsletter cover The DevSecOps Playbook: Deliver Continuous Security at Speed ($19.00 Value) FREE for a Limited Time A must-read guide to a new and rapidly growing field in