🟢⚪️ Edge#210: Hopsworks 3.0, Connecting Python to the Modern Data Stack
Was this email forwarded to you? Sign up here On Thursdays, we deep dive into one of the freshest research papers or technology frameworks that is worth your attention. Our goal is to keep you up to date with new developments in AI and introduce to you the platforms that deal with the ML challenges. 💥 Deep Dive: Hopsworks 3.0, Connecting Python to the Modern Data StackThe rise of big data, cloud computing, and artificial intelligence (AI) is transforming the way businesses operate. The ability to collect, store, and process large amounts of data is opening up new opportunities for insights and decision-making. The so-called “Modern Data Stack” (MDS) is a suite of frameworks and tools that has emerged in recent years to help businesses take advantage of these opportunities. The Modern Data Stack includes data lakes and warehouses, ETL and reverse tools, orchestration, monitoring, and much more. One vast unfilled need in the MDS is enterprise AI. Machine learning is dominated by Python tools and libraries, which are insufficient for the increasingly demanding needs of data scientists. Attempts to transpile Python code to SQL for the MDS have been made, but these efforts fall short, as data scientists are unlikely to want to perform dimensionality reduction, variable encodings, and model training/evaluation in user-defined functions and SQL. In this deep dive, we will take a look at the difference between the MDS and ML worlds, and how Hopsworks tackles the problem by building a Python-to-SQL bridge. MDS; a SQL-centric paradigmThe Modern Data Stack is heavily focused on SQL. This declarative language is relatively straightforward to learn and use, and it is the lingua franca for interacting with data warehouses, lakes, and BI tools. SQL is also well suited for distributed computing, which is necessary for processing large amounts of data. Further, the declarative nature of SQL makes it easier to scale out compute to process large volumes of data, compared to a general-purpose programming language like Python – which lacks native distributed computing support. As the Modern Data Stack continues to grow and evolve, it is clear that there is a need for a Python-to-SQL bridge that will allow data scientists to take advantage of the benefits of both worlds. Machine Learning; a Python-centric worldWhile the world of analytics is dominated by SQL, the world of machine learning is dominated by Python. This programming language has become the de facto standard for data science due to its flexibility, ease of use, and rich set of libraries and frameworks. Python's grip on machine learning is so pervasive that Stack Overflow's survey results from June 2022 show that Pandas, NumPy, TensorFlow, Scikit-Learn, and PyTorch are all in the top 11 of the most popular frameworks and libraries across all languages. Python has shown itself to be flexible enough for use within notebooks for prototyping and reporting, for production workflows (such as in Airflow), for parallel processing (PySpark, Ray, Dask), and now even for data-driven user interfaces (Streamlit). In fact, even entire serverless ML systems with feature pipelines, batch prediction pipelines, and a user interface can be written in Python, such as seen in this Surf Prediction System from PyData London. Modern Data Stack vs Modern AI Stack: Closing The GapCountless machine learning models are never deployed into production. Only about 10-15% make it into production. These sky-high failure rates are often attributed to a lack of talent or resources. In light of an ongoing labor shortage and tightening IT budgets, these explanations seem plausible, but they don't get to the root of the problem. The core issue lies in the lack of production-ready tools and infrastructure for machine learning within the MDS. The majority of machine learning models are written in Python, but the production MDS stack is not designed to make it easy to productionize Python-based machine learning models. Data scientists and ML engineers are often left with prototypes that work on data dumps, which cannot be easily connected to the rest of the MDS. Without access to features within the MDS, these prototypes are severely limited and cannot take advantage of historical or contextual data. This lack of connectivity is a major reason why so many machine learning models never make it into production. Meaningfully empowering data scientists means providing them with the tools and infrastructure they need to be successful. In particular, data scientists need to be able to access data within the MDS from Python without having to master the complexities of SQL and data access control. The Feature Store is one part of the solution to this problem. It is a new layer that bridges some of the infrastructural gap. While the likes of Snowflake's Snowpark have tried to address this problem, they fall short because they don't provide a complete solution. Without its own Feature Store, Snowpark by itself is not enough. The first Feature Stores were introduced by the Big Data community and were primarily designed for Spark and Flink. However, there has been a noticeable lack of a Python-centric Feature Store that bridges the gap between the SQL world and the Python world. Hopsworks addresses exactly that problem in order to empower data scientists to take full advantage of the Modern Data Stack. Meet Hopsworks 3.0, the Python-centric feature storeHopsworks was the first open-source feature store, released at the end of 2018, and now with the version 3.0 release, it takes a big step to bridge the Modern Data Stack with the machine learning stack in Python. With improved Read and Write APIs for Python, Hopsworks 3.0 allows data scientists to work, share, and interact with production and prototype environments in a Python-centric manner. Hopsworks uses transpilation, or source-to-source compilation, to bring the power of SQL to their Python SDK. This enables the seamless transfer of data from warehouses to Python for feature engineering and model training. It also provides a Pandas DataFrame API for writing features and ensures the consistent replication of features between Online and Offline Stores. Hopsworks 3.0 also comes with support for Great Expectations, a shared, open standard for data quality that allows for data validation in feature pipelines. Custom transformation functions can be written as Python user-defined functions and applied consistently between training and inference. To use Hopsworks 3.0 without any infrastructure requirements, try their newly released serverless app.hopsworks.ai. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
📌 Event: Join us at The Future of Data-Centric AI 2022 — a free virtual event by Snorkel AI
Wednesday, July 20, 2022
We're excited to partner with Snorkel AI on The Future of Data-Centric AI, a free two-day virtual event on August 3-4 that will cover the latest data-centric approaches to AI application
🔂 Edge#209: A New Series About ML Testing
Tuesday, July 19, 2022
Welcome to our premium newsletter that helps you learn ML concepts and focuses on the projects that move the AI industry forward. The content is trusted by the main AI labs, universities, enterprises,
📌 Event: A dive into continuous training automation – webinar by Superwise
Monday, July 18, 2022
Join us on August 9th for a live coding session as we build out a continuous MLOps pipeline. We'll start with the ML pipeline and see how we can detect performance degradation and data drift in
🤖👩🏼🎨 Meta Steps Into Generative Art with Make-A-Scene
Sunday, July 17, 2022
Subscribe today if you haven't yet!
⚡️ Last day 50% OFF
Saturday, July 16, 2022
Don't miss out
You Might Also Like
What Investors Want From AI Startups in 2025
Monday, November 25, 2024
Top Tech Content sent at Noon! How the world collects web data Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, November 25, 2024? The HackerNoon
GCP Newsletter #426
Monday, November 25, 2024
Welcome to issue #426 November 25th, 2024 News LLM Official Blog Vertex AI Announcing Mistral AI's Large-Instruct-2411 on Vertex AI - Google Cloud has announced the availability of Mistral AI's
⏳ 36 Hours Left: Help Get "The Art of Data" Across the Finish Line 🏁
Monday, November 25, 2024
Visual Capitalist plans to unveal its secrets behind data storytelling, but only if the book hits its minimum funding goal. View Online | Subscribe | Download Our App We Need Your Help Only 36 Hours
DeveloPassion's Newsletter #180 - Black Friday Week
Monday, November 25, 2024
Edition 180 of my newsletter, discussing Knowledge Management, Knowledge Work, Zen Productivity, Personal Organization, and more! Sébastien Dubois DeveloPassion's Newsletter DeveloPassion's
Meet HackerNoon's Latest Features: Boost Stories with Translations, Speech-to-Text & More
Monday, November 25, 2024
Hey, Hacker! HackerNoon's monthly product update is here! Get ready for a new version of the mobile app, more translation developments, a new AI Gallery, backend moves, and more! 🚀 This product
The ultimate holiday gadget gift
Monday, November 25, 2024
AI isn't hitting a wall; $70 off Apple Watch; 60+ Amazon deals -- ZDNET ZDNET Tech Today - US November 25, 2024 Meta Quest 3S Why the Meta Quest 3S is the ultimate 2024 holiday present This $299
Deduplication in Distributed Systems: Myths, Realities, and Practical Solutions
Monday, November 25, 2024
This week, we'll discuss the deduplication strategies. We'll see whether they're useful and consider scenarios where you may need them. We'll also do a reality check with the promises
How to know if your data has been exposed
Monday, November 25, 2024
How do you know if your personal data has been leaked? Imagine getting an instant notification if your SSN, credit card, or password has been exposed on the dark web — so you can take action
⚙️ Amazon and Anthropic
Monday, November 25, 2024
Plus: The hidden market of body-centric data
⚡ THN Recap: Top Cybersecurity Threats, Tools & Tips (Nov 18-24)
Monday, November 25, 2024
Don't miss the vital updates you need to stay secure. Read the full recap now. The Hacker News THN Recap: Top Cybersecurity Threats, Tools, and Practices (Nov 18 - Nov 24) We hear terms like “state