🎙Ronen Dar, Run:AI's CTO, on managing computation resources in ML pipelines
Was this email forwarded to you? Sign up here It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed. 👤 Quick bio / Ronen Dar
Ronen Dar (RD): I did my bachelor’s, master's, and Ph.D. at Tel Aviv University. I met Omri Geller (Run:AI’s CEO) there, working toward his master’s while I was on my Ph.D. Alongside my studies, I was also working in the industry for a startup called Anobit Technologies, a maker of flash storage technology that Apple acquired during my time there. I stayed on at Apple for several years, and it was really fun having one foot in academia and another in the industry. But when I finished my Ph.D., I had to choose. At first, I chose academia and did my postdoc as a research scientist at Bell Labs in the US. My Ph.D. and postdoc showed me the importance of having easy access to computing power and what you can achieve as a researcher when you have access to large amounts of computing resources. Omri and I knew that we could put unlimited computing power into the hands of every researcher, so when we decided to start Run:AI, I made the switch from academia to becoming a founder and came back to Israel to work alongside Omri as the CTO. 🛠 ML Work
DR: There are two key challenges for AI development right now, and they’re only going to increase in importance as more and more companies start doing AI. The first is that AI adoption across enterprises in nearly every industry is driving demand for more powerful computing resources, GPUs, to provide the levels of computing power needed for AI at scale. The second factor is that it’s incredibly difficult to access the full computing power of these new GPUs. So many organizations are struggling with GPU allocation and orchestration. Omri and I saw a gap between the amount of computing power that GPUs can offer and the amount that current orchestration tools can access and provision. There’s a new software stack being composed to deal with this issue, and we wanted to be in that stack. The vision of Run:AI is to accelerate AI-driven innovation in every industry by making it easy for researchers and IT to access and manage all their available computing power.
DR: For beginning AI initiatives, there is a need to optimize the algorithms themselves. When you have just one algorithm, one workload running and consuming compute power, it's really difficult to optimize how that algorithm is using that computing power. There is also a challenge of optimization when you have a lot of workloads running across several GPUs. How will computing sources be shared when there are multiple workloads? How will you ensure that each researcher and team gets their fair share? How will you size each workload, and in the end, how will all of them align together on one shareable infrastructure? That isn't easy, and that's a different kind of optimization.
DR: Kubernetes is at the heart of what's going on today in the ML space. Right now, it's like the perfect storm is happening: there are new AI applications, so new kinds of workloads with new compute requirements. Then you also have those new computing resources, those GPUs, those deep learning accelerators. And on top of that, the world is shifting to cloud-native infrastructure. Companies are moving their infrastructure from a virtualized environment to containers, Kubernetes, and other cloud-native technologies. The problem when you put all these things together is that Kubernetes wasn't built to run compute-intensive AI workloads on this new hardware. It was built to run microservices on consumer CPUs. There are major gaps in what Kubernetes provides today. It lacks advanced preemption mechanisms that ensure fairness and doesn’t use multiple queues to efficiently orchestrate long-running jobs. In addition, K8s is missing gang scheduling for scaling up parallel processing AI workloads to multiple distributed nodes and topology awareness for optimizing performance. Kubernetes clusters often result in resources left idle for too long, and users find themselves limited in the compute power they can consume. The Run:AI scheduler sits on top of Kubernetes and specifically targets these shortcomings to provide a made-for-AI scheduling solution.
DR: One big challenge with GPU management is that unlike CPUs, and traditional applications running on CPU cores, GPUs are being allocated to applications statically and exclusively. When applications start to run, they get static allocations of GPUs, and sharing those GPUs between multiple workloads is typically really inefficient. In the CPU world, there is virtualization, but with GPUs, you don't have that software layer with the ability to orchestrate workloads in a dynamic way. Manual allocation of GPUs leads to poor GPU utilization. Many organizations share with us that their typical GPU utilization is at 10-20%, with highly limited data science productivity. You need a software layer to allocate the workload in a dynamic way on the GPUs and really let the workload share the GPUs dynamically and efficiently.
DR: Yeah, we do see a mismatch between ML hardware and software architectures, in that the software just doesn't answer all of the workloads’ needs. The existing software layers don't fit any hardware, making it really difficult for new hardware to come in and integrate with the existing ML architecture. Hardware companies are investing heavily into their software stack so that they can integrate with existing software architectures, but it's really, really difficult. That’s where Run:AI is trying to help. We're building our software architecture to fit any AI hardware. It’s important for us to be neutral and be able to support any hardware. We think that is key to enabling innovation and beneficial competition in the AI hardware space 💥 Recommended bookWell, if you’re an ML engineer who aspires to become a founder, check out The Hard Thing About Hard Things, by Ben Horowitz. He co-founded a company and sold it to HP at a value of more than $1 billion. Then, he co-founded the venture capital firm Andreessen Horowitz. He’s one of the most famous VCs in the AI world, and in the book, he shares his experiences building an AI company. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
🕸 Edge#165: AutoRegressive Networks
Tuesday, February 15, 2022
In this issue: we discuss AutoRegressive Networks; we explore DeepMind's PixelRNN and PixelCNN, two of the most important autoregressive models for image generation; we overview MMGeneration, a new
💜Three days only - 50% OFF💜 Subscribe today
Monday, February 14, 2022
Happy Valentine! We 🫀 and 🧠 you Important question How do you stay up-to-date with the fast-moving AI&ML industry? We heard that question a lot. Some people thought that was impossible. Then we
🚩 Machine Learning’s Last Mile Problem
Sunday, February 13, 2022
Weekly news digest curated by the industry insiders
📝 Guest post: The Rise of Shadow AI*
Friday, February 11, 2022
No subscription is needed
🤖 👶 Edge#164: Meta’s Data2vec is a New Self-Supervised Model that Works for Speech, Vision, and Text
Thursday, February 10, 2022
On Thursdays, we dive deep into one of the freshest research papers or technology frameworks that is worth your attention. Our goal is to keep you up to date with new developments in AI to complement
You Might Also Like
🔒 The Vault Newsletter: November issue 🔑
Monday, November 25, 2024
Get the latest business security news, updates, and advice from 1Password. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
🧐 The Most Interesting Phones You Didn't See in 2024 — Making Reddit Faster on Older Devices
Monday, November 25, 2024
Also: Best Black Friday Deals So Far, and More! How-To Geek Logo November 25, 2024 Did You Know If you look closely over John Lennon's shoulder on the iconic cover of The Beatles Abbey Road album,
JSK Daily for Nov 25, 2024
Monday, November 25, 2024
JSK Daily for Nov 25, 2024 View this email in your browser A community curated daily e-mail of JavaScript news JavaScript Certification Black Friday Offer – Up to 54% Off! Certificates.dev, the trusted
Ranked | How Americans Rate Business Figures 📊
Monday, November 25, 2024
This graphic visualizes the results of a YouGov survey that asks Americans for their opinions on various business figures. View Online | Subscribe Presented by: Non-consensus strategies that go where
Spyglass Dispatch: Apple Throws Their Film to the Wolves • The AI Supercomputer Arms Race • Sony's Mobile Game • The EU Hunts Bluesky • Bluesky Hunts User Trust • 'Glicked' Pricked • One Massive iPad
Monday, November 25, 2024
Apple Throws Their Film to the Wolves • The AI Supercomputer Arms Race • Sony's Mobile Game • The EU Hunts Bluesky • Bluesky Hunts User Trust • 'Glicked' Pricked • One Massive iPad The
Daily Coding Problem: Problem #1619 [Hard]
Monday, November 25, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given two non-empty binary trees s and t , check whether tree t has exactly the
Unpacking “Craft” in the Software Interface & The Five Pillars of Creative Flow
Monday, November 25, 2024
Systems Over Substance, Anytype's autumn updates, Ghost's progress with its ActivityPub integration, and a lot more in this week's issue of Creativerly. Creativerly Unpacking “Craft” in the
What Investors Want From AI Startups in 2025
Monday, November 25, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 25, 2024? The HackerNoon
GCP Newsletter #426
Monday, November 25, 2024
Welcome to issue #426 November 25th, 2024 News LLM Official Blog Vertex AI Announcing Mistral AI's Large-Instruct-2411 on Vertex AI - Google Cloud has announced the availability of Mistral AI's
⏳ 36 Hours Left: Help Get "The Art of Data" Across the Finish Line 🏁
Monday, November 25, 2024
Visual Capitalist plans to unveal its secrets behind data storytelling, but only if the book hits its minimum funding goal. View Online | Subscribe | Download Our App We Need Your Help Only 36 Hours