🟢⚪️ Edge#210: Hopsworks 3.0, Connecting Python to the Modern Data Stack
Was this email forwarded to you? Sign up here On Thursdays, we deep dive into one of the freshest research papers or technology frameworks that is worth your attention. Our goal is to keep you up to date with new developments in AI and introduce to you the platforms that deal with the ML challenges. 💥 Deep Dive: Hopsworks 3.0, Connecting Python to the Modern Data StackThe rise of big data, cloud computing, and artificial intelligence (AI) is transforming the way businesses operate. The ability to collect, store, and process large amounts of data is opening up new opportunities for insights and decision-making. The so-called “Modern Data Stack” (MDS) is a suite of frameworks and tools that has emerged in recent years to help businesses take advantage of these opportunities. The Modern Data Stack includes data lakes and warehouses, ETL and reverse tools, orchestration, monitoring, and much more. One vast unfilled need in the MDS is enterprise AI. Machine learning is dominated by Python tools and libraries, which are insufficient for the increasingly demanding needs of data scientists. Attempts to transpile Python code to SQL for the MDS have been made, but these efforts fall short, as data scientists are unlikely to want to perform dimensionality reduction, variable encodings, and model training/evaluation in user-defined functions and SQL. In this deep dive, we will take a look at the difference between the MDS and ML worlds, and how Hopsworks tackles the problem by building a Python-to-SQL bridge. MDS; a SQL-centric paradigmThe Modern Data Stack is heavily focused on SQL. This declarative language is relatively straightforward to learn and use, and it is the lingua franca for interacting with data warehouses, lakes, and BI tools. SQL is also well suited for distributed computing, which is necessary for processing large amounts of data. Further, the declarative nature of SQL makes it easier to scale out compute to process large volumes of data, compared to a general-purpose programming language like Python – which lacks native distributed computing support. As the Modern Data Stack continues to grow and evolve, it is clear that there is a need for a Python-to-SQL bridge that will allow data scientists to take advantage of the benefits of both worlds. Machine Learning; a Python-centric worldWhile the world of analytics is dominated by SQL, the world of machine learning is dominated by Python. This programming language has become the de facto standard for data science due to its flexibility, ease of use, and rich set of libraries and frameworks. Python's grip on machine learning is so pervasive that Stack Overflow's survey results from June 2022 show that Pandas, NumPy, TensorFlow, Scikit-Learn, and PyTorch are all in the top 11 of the most popular frameworks and libraries across all languages. Python has shown itself to be flexible enough for use within notebooks for prototyping and reporting, for production workflows (such as in Airflow), for parallel processing (PySpark, Ray, Dask), and now even for data-driven user interfaces (Streamlit). In fact, even entire serverless ML systems with feature pipelines, batch prediction pipelines, and a user interface can be written in Python, such as seen in this Surf Prediction System from PyData London. Modern Data Stack vs Modern AI Stack: Closing The GapCountless machine learning models are never deployed into production. Only about 10-15% make it into production. These sky-high failure rates are often attributed to a lack of talent or resources. In light of an ongoing labor shortage and tightening IT budgets, these explanations seem plausible, but they don't get to the root of the problem. The core issue lies in the lack of production-ready tools and infrastructure for machine learning within the MDS. The majority of machine learning models are written in Python, but the production MDS stack is not designed to make it easy to productionize Python-based machine learning models. Data scientists and ML engineers are often left with prototypes that work on data dumps, which cannot be easily connected to the rest of the MDS. Without access to features within the MDS, these prototypes are severely limited and cannot take advantage of historical or contextual data. This lack of connectivity is a major reason why so many machine learning models never make it into production. Meaningfully empowering data scientists means providing them with the tools and infrastructure they need to be successful. In particular, data scientists need to be able to access data within the MDS from Python without having to master the complexities of SQL and data access control. The Feature Store is one part of the solution to this problem. It is a new layer that bridges some of the infrastructural gap. While the likes of Snowflake's Snowpark have tried to address this problem, they fall short because they don't provide a complete solution. Without its own Feature Store, Snowpark by itself is not enough. The first Feature Stores were introduced by the Big Data community and were primarily designed for Spark and Flink. However, there has been a noticeable lack of a Python-centric Feature Store that bridges the gap between the SQL world and the Python world. Hopsworks addresses exactly that problem in order to empower data scientists to take full advantage of the Modern Data Stack. Meet Hopsworks 3.0, the Python-centric feature storeHopsworks was the first open-source feature store, released at the end of 2018, and now with the version 3.0 release, it takes a big step to bridge the Modern Data Stack with the machine learning stack in Python. With improved Read and Write APIs for Python, Hopsworks 3.0 allows data scientists to work, share, and interact with production and prototype environments in a Python-centric manner. Hopsworks uses transpilation, or source-to-source compilation, to bring the power of SQL to their Python SDK. This enables the seamless transfer of data from warehouses to Python for feature engineering and model training. It also provides a Pandas DataFrame API for writing features and ensures the consistent replication of features between Online and Offline Stores. Hopsworks 3.0 also comes with support for Great Expectations, a shared, open standard for data quality that allows for data validation in feature pipelines. Custom transformation functions can be written as Python user-defined functions and applied consistently between training and inference. To use Hopsworks 3.0 without any infrastructure requirements, try their newly released serverless app.hopsworks.ai. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
📌 Event: Join us at The Future of Data-Centric AI 2022 — a free virtual event by Snorkel AI
Wednesday, July 20, 2022
We're excited to partner with Snorkel AI on The Future of Data-Centric AI, a free two-day virtual event on August 3-4 that will cover the latest data-centric approaches to AI application
🔂 Edge#209: A New Series About ML Testing
Tuesday, July 19, 2022
Welcome to our premium newsletter that helps you learn ML concepts and focuses on the projects that move the AI industry forward. The content is trusted by the main AI labs, universities, enterprises,
📌 Event: A dive into continuous training automation – webinar by Superwise
Monday, July 18, 2022
Join us on August 9th for a live coding session as we build out a continuous MLOps pipeline. We'll start with the ML pipeline and see how we can detect performance degradation and data drift in
🤖👩🏼🎨 Meta Steps Into Generative Art with Make-A-Scene
Sunday, July 17, 2022
Subscribe today if you haven't yet!
⚡️ Last day 50% OFF
Saturday, July 16, 2022
Don't miss out
You Might Also Like
Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator
Friday, February 14, 2025
What came before the golem? ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Defining Your Paranoia Level: Navigating Change Without the Overkill
Friday, February 14, 2025
We've all been there: trying to learn something new, only to find our old habits holding us back. We discussed today how our gut feelings about solving problems can sometimes be our own worst enemy
5 ways AI can help with taxes 🪄
Friday, February 14, 2025
Remotely control an iPhone; 💸 50+ early Presidents' Day deals -- ZDNET ZDNET Tech Today - US February 10, 2025 5 ways AI can help you with your taxes (and what not to use it for) 5 ways AI can help
Recurring Automations + Secret Updates
Friday, February 14, 2025
Smarter automations, better templates, and hidden updates to explore 👀 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
The First Provable AI-Proof Game: Introducing Butterfly Wings 4
Friday, February 14, 2025
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? undefined The Market Today #01 Instagram (Meta) 714.52 -0.32%
GCP Newsletter #437
Friday, February 14, 2025
Welcome to issue #437 February 10th, 2025 News BigQuery Cloud Marketplace Official Blog Partners BigQuery datasets now available on Google Cloud Marketplace - Google Cloud Marketplace now offers
Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰
Friday, February 14, 2025
Discover how the share of US wealth held by the top 1% has evolved from 1989 to 2024 in this infographic. View Online | Subscribe | Download Our App Download our app to see thousands of new charts from
The Great Social Media Diaspora & Tapestry is here
Friday, February 14, 2025
Apple introduces new app called 'Apple Invites', The Iconfactory launches Tapestry, beyond the traditional portfolio, and more in this week's issue of Creativerly. Creativerly The Great
Daily Coding Problem: Problem #1689 [Medium]
Friday, February 14, 2025
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. Given a linked list, sort it in O(n log n) time and constant space. For example,
📧 Stop Conflating CQRS and MediatR
Friday, February 14, 2025
Stop Conflating CQRS and MediatR Read on: my website / Read time: 4 minutes The .NET Weekly is brought to you by: Step right up to the Generative AI Use Cases Repository! See how MongoDB powers your