🟢⚪️ Edge#210: Hopsworks 3.0, Connecting Python to the Modern Data Stack
Was this email forwarded to you? Sign up here On Thursdays, we deep dive into one of the freshest research papers or technology frameworks that is worth your attention. Our goal is to keep you up to date with new developments in AI and introduce to you the platforms that deal with the ML challenges. 💥 Deep Dive: Hopsworks 3.0, Connecting Python to the Modern Data StackThe rise of big data, cloud computing, and artificial intelligence (AI) is transforming the way businesses operate. The ability to collect, store, and process large amounts of data is opening up new opportunities for insights and decision-making. The so-called “Modern Data Stack” (MDS) is a suite of frameworks and tools that has emerged in recent years to help businesses take advantage of these opportunities. The Modern Data Stack includes data lakes and warehouses, ETL and reverse tools, orchestration, monitoring, and much more. One vast unfilled need in the MDS is enterprise AI. Machine learning is dominated by Python tools and libraries, which are insufficient for the increasingly demanding needs of data scientists. Attempts to transpile Python code to SQL for the MDS have been made, but these efforts fall short, as data scientists are unlikely to want to perform dimensionality reduction, variable encodings, and model training/evaluation in user-defined functions and SQL. In this deep dive, we will take a look at the difference between the MDS and ML worlds, and how Hopsworks tackles the problem by building a Python-to-SQL bridge. MDS; a SQL-centric paradigmThe Modern Data Stack is heavily focused on SQL. This declarative language is relatively straightforward to learn and use, and it is the lingua franca for interacting with data warehouses, lakes, and BI tools. SQL is also well suited for distributed computing, which is necessary for processing large amounts of data. Further, the declarative nature of SQL makes it easier to scale out compute to process large volumes of data, compared to a general-purpose programming language like Python – which lacks native distributed computing support. As the Modern Data Stack continues to grow and evolve, it is clear that there is a need for a Python-to-SQL bridge that will allow data scientists to take advantage of the benefits of both worlds. Machine Learning; a Python-centric worldWhile the world of analytics is dominated by SQL, the world of machine learning is dominated by Python. This programming language has become the de facto standard for data science due to its flexibility, ease of use, and rich set of libraries and frameworks. Python's grip on machine learning is so pervasive that Stack Overflow's survey results from June 2022 show that Pandas, NumPy, TensorFlow, Scikit-Learn, and PyTorch are all in the top 11 of the most popular frameworks and libraries across all languages. Python has shown itself to be flexible enough for use within notebooks for prototyping and reporting, for production workflows (such as in Airflow), for parallel processing (PySpark, Ray, Dask), and now even for data-driven user interfaces (Streamlit). In fact, even entire serverless ML systems with feature pipelines, batch prediction pipelines, and a user interface can be written in Python, such as seen in this Surf Prediction System from PyData London. Modern Data Stack vs Modern AI Stack: Closing The GapCountless machine learning models are never deployed into production. Only about 10-15% make it into production. These sky-high failure rates are often attributed to a lack of talent or resources. In light of an ongoing labor shortage and tightening IT budgets, these explanations seem plausible, but they don't get to the root of the problem. The core issue lies in the lack of production-ready tools and infrastructure for machine learning within the MDS. The majority of machine learning models are written in Python, but the production MDS stack is not designed to make it easy to productionize Python-based machine learning models. Data scientists and ML engineers are often left with prototypes that work on data dumps, which cannot be easily connected to the rest of the MDS. Without access to features within the MDS, these prototypes are severely limited and cannot take advantage of historical or contextual data. This lack of connectivity is a major reason why so many machine learning models never make it into production. Meaningfully empowering data scientists means providing them with the tools and infrastructure they need to be successful. In particular, data scientists need to be able to access data within the MDS from Python without having to master the complexities of SQL and data access control. The Feature Store is one part of the solution to this problem. It is a new layer that bridges some of the infrastructural gap. While the likes of Snowflake's Snowpark have tried to address this problem, they fall short because they don't provide a complete solution. Without its own Feature Store, Snowpark by itself is not enough. The first Feature Stores were introduced by the Big Data community and were primarily designed for Spark and Flink. However, there has been a noticeable lack of a Python-centric Feature Store that bridges the gap between the SQL world and the Python world. Hopsworks addresses exactly that problem in order to empower data scientists to take full advantage of the Modern Data Stack. Meet Hopsworks 3.0, the Python-centric feature storeHopsworks was the first open-source feature store, released at the end of 2018, and now with the version 3.0 release, it takes a big step to bridge the Modern Data Stack with the machine learning stack in Python. With improved Read and Write APIs for Python, Hopsworks 3.0 allows data scientists to work, share, and interact with production and prototype environments in a Python-centric manner. Hopsworks uses transpilation, or source-to-source compilation, to bring the power of SQL to their Python SDK. This enables the seamless transfer of data from warehouses to Python for feature engineering and model training. It also provides a Pandas DataFrame API for writing features and ensures the consistent replication of features between Online and Offline Stores. Hopsworks 3.0 also comes with support for Great Expectations, a shared, open standard for data quality that allows for data validation in feature pipelines. Custom transformation functions can be written as Python user-defined functions and applied consistently between training and inference. To use Hopsworks 3.0 without any infrastructure requirements, try their newly released serverless app.hopsworks.ai. You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Key phrases
Older messages
📌 Event: Join us at The Future of Data-Centric AI 2022 — a free virtual event by Snorkel AI
Wednesday, July 20, 2022
We're excited to partner with Snorkel AI on The Future of Data-Centric AI, a free two-day virtual event on August 3-4 that will cover the latest data-centric approaches to AI application
🔂 Edge#209: A New Series About ML Testing
Tuesday, July 19, 2022
Welcome to our premium newsletter that helps you learn ML concepts and focuses on the projects that move the AI industry forward. The content is trusted by the main AI labs, universities, enterprises,
📌 Event: A dive into continuous training automation – webinar by Superwise
Monday, July 18, 2022
Join us on August 9th for a live coding session as we build out a continuous MLOps pipeline. We'll start with the ML pipeline and see how we can detect performance degradation and data drift in
🤖👩🏼🎨 Meta Steps Into Generative Art with Make-A-Scene
Sunday, July 17, 2022
Subscribe today if you haven't yet!
⚡️ Last day 50% OFF
Saturday, July 16, 2022
Don't miss out
You Might Also Like
AI search engine startup Perplexity eyes a $3B valuation
Tuesday, April 23, 2024
Plus: It's Tesla earnings day and AWS wants to host your AI models View this email online in your browser By Cody Corrall Tuesday, April 23, 2024 Welcome back to TechCrunch PM. Today we have big
🎞️ We Tried 3D Printing a Photo — You'll Love This Secret Samsung Galaxy Bluetooth Feature
Tuesday, April 23, 2024
Also: Transferring Your Phone Number to a New Carrier, and More! How-To Geek Logo April 23, 2024 📩 Get expert reviews, the hottest deals, how-to's, breaking news, and more delivered directly to
You're invited – product sense, prioritization, careers
Tuesday, April 23, 2024
Product Sense Product Sense Wednesday, May 1st @ 01:00 PM EST Learn how to identify opportunities, assess risks, and make informed decisions that lead to successful product innovations by better
CTRL-C, Exceptions, Ruff Speed-up, and More
Tuesday, April 23, 2024
Asyncio Handle Control-C (SIGINT) #626 – APRIL 23, 2024 VIEW IN BROWSER The PyCoder's Weekly Logo Asyncio Handle Control-C (SIGINT) When the user presses CTRL-C on the keyboard, the OS raises an
Writing Contests Just Landed On Product Hunt 🔥
Tuesday, April 23, 2024
Upvote us to keep the $$$ coming! 👍 ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Daily Coding Problem: Problem #1421 [Hard]
Tuesday, April 23, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Uber. Given an array of integers, return a new array such that each element at index i
Ranked | The Top 10 EV Battery Manufacturers 🔋
Tuesday, April 23, 2024
Asia dominates this ranking of the world's largest EV battery manufacturers in 2023. See which battery makers feature in the top 10. View Online | Subscribe Presented by: EnergyX's
Bringing PGO to the build pipeline
Tuesday, April 23, 2024
Plus how Go grew at Google, cmp.Or, and ways to visualize makefiles, Go binaries, and live Go processes. | #504 — April 23, 2024 Unsub | Web Version Together with Three Dots Labs Go Weekly How Dolt
Noonification: Leetcode: Two-sum an Intuitive Approach
Tuesday, April 23, 2024
Top Tech Content sent at Noon! Get Algolia: AI Search that understands How are you, @newsletterest1? 🪐 What's happening in tech this week: The Noonification by HackerNoon has got you covered with
The best AI chatbot for coding
Tuesday, April 23, 2024
9 video gadget must-haves; 6 things Linux should borrow from MacOS -- ZDNET ZDNET Tech Today - US April 23, 2024 placeholder Can Meta AI code? I tested it against Llama, Gemini and ChatGPT - it wasn