📝 Guest post: Fast Access to Feature Data for AI Applications with Hopsworks*
Was this email forwarded to you? Sign up here In this article, Hopsworks’s team dives into the details of the requirements of AI-powered online applications and how the Hopsworks Feature Store abstracts away the complexity of a dual storage system. Enterprise Machine Learning models are most valuable when they are powering a part of a product by guiding user interaction. Oftentimes these ML models are applied to an entire database of entities, for example users identified by a unique primary key. An example for such an offline application, would be predictive Customer Lifetime Value, where a prediction can be precomputed in batches in regular intervals (nightly, weekly), and is then used to select target audiences for marketing campaigns. More advanced AI-powered applications, however, guide user interaction in real-time, such as recommender systems. For these online applications, some part of the model input (feature vector) will be available in the application itself, such as the last button clicked on, while other parts of the feature vector rely on historical or contextual data and have to be retrieved from a backend storage, such as the number of times the user clicked on the button in the last hour or whether the button is a popular button. Machine Learning Models in ProductionWhile batch applications with (analytical) models are largely similar to the training of the model itself, requiring efficient access to large volumes of data that will be scored, online applications require low latency access to latest feature values for a given primary key (potentially, multi-part) which is then sent as a feature vector to the model serving instance for inference. To the best of our knowledge, there is no single database accommodating both of these requirements at high performance. Therefore, data teams tended to keep the data for training and batch inference in data lakes, while ML engineers built microservices to replicate the feature engineering logic in microservices for online applications. This, however, introduces unnecessary obstacles for both Data Scientists and ML engineers to iterate quickly and significantly increases the time to production for ML models:
Hopsworks Feature Store: A Transparent Dual Storage SystemThe Hopsworks Feature Store is a dual storage system, consisting of the high-bandwidth (low-cost) offline storage and the low-latency online store. The offline storage is a mix of Apache Hudi tables on our HopsFS file system (backed by S3 or Azure Blob Storage) and external tables (such as Snowflake, Redshift, etc), together , providing access to large volumes of feature data for training or batch scoring. In contrast, the online store is a low latency key value database that stores only the latest value of each feature and its primary key. The online feature store thereby acts as a low latency cache for these feature values. In order for this system to be valuable for data scientists and to improve the time to production, as well as providing a nice experience for the end user, it needs to meet some requirements:
The Hopsworks Online Feature Store is built around four pillars in order to satisfy the requirements while scaling to manage large amounts of data:
RonDB: The Online Feature Store, Foundation of the File System and MetadataHopsworks is built from the ground up around distributed scaleout metadata. This helps to ensure consistency and scalability of the services within Hopsworks as well as the annotation and discoverability of data and ML artifacts. Since the first release, Hopsworks has been using NDB Cluster (a precursor to RonDB) as the online feature store. In 2020, we created RonDB as a managed version of NDB Cluster, optimized for use as an online feature store. However, in Hopsworks, we use RonDB for more than just the Online Feature Store. RonDB also stores metadata for the whole Feature Store, including schemas, statistics, and commits. RonDB also stores the metadata of the file system, HopsFS, in which offline Hudi tables are stored. Using RonDB as a single metadata database, we use transactions and foreign keys to keep the Feature Store and Hudi metadata consistent with the target files and directories (inodes). Hopsworks is accessible either through a REST API or an intuitive UI (that includes a Feature Catalog), or programmatically through the Hopsworks Feature Store API (HSFS). With the underlying RonDB and the needed metadata in place, we were able to build a scale-out, high throughput materialization service to perform the updates, deletes, and writes on the online feature store - we simply named it OnlineFS. OnlineFS: The Engine for Scalable Online Feature MaterializationWith the underlying RonDB and the needed metadata in place, we were able to build a scale-out, high throughput materialization service to perform the updates, deletes, and writes on the online feature store - we simply named it OnlineFS. OnlineFS is a stateless service using ClusterJ for direct access to the RonDB data nodes. ClusterJ is implemented as a high performance JNI layer on top of the native C++ NDB API, providing low latency and high throughput. We were able to make OnlineFS stateless due to the availability of the metadata in RonDB, such as avro schemas and feature types. Making the service stateless allows us to scale writes to the online feature store up and down by simply adding or removing instances of the service, thereby increasing or decreasing throughput linearly with the number of instances. The steps that are needed to write data to the online feature store:
To learn more about each step and transparency in distributed systems please continue to read here. In our blog, we also provide some benchmarks for quantitative comparison. *This post was written by the Hopsworks team and originally posted here. We thank Hopsworks for their ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
🟧 Edge#192: Inside Predibase, the Enterprise Declarative ML Platform
Thursday, May 19, 2022
Our goal is to keep you up to date with new developments in AI and introduce to you the platforms that deal with the ML challenges
📝 Guest post: How to Measure Your GPU Cluster Utilization, and Why That Matters*
Wednesday, May 18, 2022
In this article, Run:AI's team introduces rntop, a new super useful open-source tool that measures GPU cluster utilization. Learn why that's a critical measure for data scientists, as well as
📨 Edge#191: MPI – the Fundamental Enabler of Distributed Training
Tuesday, May 17, 2022
In this issue: we discuss the fundamental enabler of distributed training: message passing interface (MPI); +Google's paper about General and Scalable Parallelization for ML Computation Graphs; +
📌Event: Join the Largest Conference on MLOps: 3rd Annual MLOps World 2022! 🎉
Monday, May 16, 2022
We are happy to support the 3rd Annual MLOps World 2022! The MLOps World Committee would like to invite you this June 9-10th for a truly must-attend event, and an unforgettable experience in Toronto,
Google’s Big ML Week
Sunday, May 15, 2022
Weekly news digest curated by the industry insiders
You Might Also Like
PHPWeekly March 28th 2024
Thursday, March 28, 2024
Curated news all about PHP. Here's the latest edition Is this email not displaying correctly? View it in your browser. PHP Weekly 28th March 2024 Hi everyone, The long weekend is coming up, and if
Hulu officially joins Disney+
Thursday, March 28, 2024
The Morning After It's Thursday, March 28, 2024. A month after taking full ownership of Hulu last November, Disney started beta testing integration with Disney+. Today, Hulu on Disney+ is
Post from Syncfusion Blogs on 03/28/2024
Thursday, March 28, 2024
New blogs from Syncfusion Chart of the Week: Creating a .NET MAUI Column Chart to Visualize the Corporate Investment in AI By Saiyath Ali Fathima M Let's visualize the data on corporates'
New ZenHammer Attack Bypasses Rowhammer Defenses on AMD CPUs
Thursday, March 28, 2024
THN Daily Updates Newsletter cover Webinar: From Blind Spots to Bulletproof: Secure Your Apps with Shared Responsibility From oversight to overwatch: Discover the art of bulletproof app security with
Top Tech 🏆 Synology BeeStation NAS Review — Testing Anker's EverFrost Dual-Zone Powered Cooler
Thursday, March 28, 2024
Also: We Review the Arlo Essential Indoor Cam 2nd Gen, and More! How-To Geek Logo March 28, 2024 📩 Get the hottest deals, how-to's, breaking news, and more delivered directly to your inbox by
Last Chance
Thursday, March 28, 2024
Hello there, I wanted to follow up on our last email to let you know that our introductory iPhone Life Insider offer will expire tomorrow! Currently, a subscription to iPhone Life Insider costs $9.99/
Edge 381: Google DeepMind's PrompBreeder Self-Improves Prompts
Thursday, March 28, 2024
The method combines chain of thoughts, plan and solve and evolutionary algorithms in a single mthod. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Anthropic beats GPT-4 🤖, Pixel 9 leaks 📱, the fight for AI talent 👨💻
Thursday, March 28, 2024
Anthropic's Claude 3 Opus has surpassed OpenAI's GPT-4 for the first time on Chatbot Arena Sign Up|Advertise|View Online TLDR Together With Dollar Flight Club TLDR 2024-03-28 Exclusive offer:
From Request to Response: How APIs Work – Beginners Guide
Thursday, March 28, 2024
In the vast expanse of the digital ecosystem, APIs (Application Programming Interfaces) act as critical conduits, facilitating seamless conversations between different software platforms. From clicking
Elastic 8.13 is here: Amazon Bedrock in the AI Assistant for Observability
Thursday, March 28, 2024
Learn about Amazon Bedrock support within the Elastic AI Assistant for Observability ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ elastic | Search. Observe. Protect