📝 Guest Post: Using LLMs from Hugging Face? Fix your model failure points 10x faster with Galileo Data Intelligen…
Was this email forwarded to you? Sign up here Large Language Models (LLMs) are powerful assets for data scientists to leverage within their applications – Hugging Face is a leading repository for LLMs today. However, while using LLMs, the practical reality is that the quality of the training data governs model performance, and data scientists often spend 80% of their time in Excel sheets and python scripts, trying to find the data that pulls the model performance down, whether while training a model, or for models in production. In this guest post, co-founder and CEO of Galileo Vikram Chatterji explains how to:
Want to try Galileo for yourself? Feel free to reach out here to get a personalized demo from a member of the Galileo Data Science team. 🚀Few Lines of Code: Using Galileo While Training a Hugging Face ModelWe’ll be using the popular CoNLLpp dataset. Using Galileo, we will quickly be able to find a host of data errors:
STEP 1: Install `dataquality` and initialize Galileo For this tutorial, you need at least Python 3.7. As a development environment, you can use Google Colaboratory. The first step is to install dataquality (Galileo's python client) and datasets, evaluate, and transformers (HuggingFace).
STEP 2: Load, Tokenize and Log the Hugging Face 🤗Dataset The next step is to load your dataset. For this demo, we will use the popular `conllpp` dataset, which follows the same NER data format as any other HuggingFace dataset. Galileo provides Hugging Face integrations to allow tokenization and label alignment. Behind the scenes, it logs your input data automatically. STEP 3: Training the NER Model Now we're ready to train our HuggingFace model for a Named Entity Recognition task. You simply call trainer.train() and you'd be set. But we’re here to drill down into this dataset and find data errors or samples the model struggles with. To achieve that, we wrap the trainer in Galileo’s “watch” function and call dq.finish() at the end to publish the results to Galileo. It’s THAT simple!
When the model finishes training, you’ll see a link to the Galileo Console. ⚠️⚠️Find and fix data errors instantly: Data-centric model inspection with GalileoWithin a glance, Galileo points out the data that is pulling your model performance down. The Galileo console is designed to allow you to perform deep exploration of your data while giving you alerts out of the box to act as jumping boards to find problematic pockets of data. On the right, you can view your dataset in table form, or in the embedding space. DATA ERROR 1: Regions of high Data Error Potential (DEP) – a high precision ML data quality metric The dataset is sorted by the Data Error Potential score of the sample - a metric built by Galileo to provide a holistic data quality score for each sample to identify samples in the dataset contributing to low or high model performance.
DATA ERROR 2: Missed Annotations Conllpp, despite being a massively peer-reviewed dataset, still has many missing annotations. Galileo surfaces these via the “Missed Annotations” alert. Clicking on it allows you to inspect further and in one-click add the annotations in-tool or send to your Labeling tool. DATA ERROR 3: Errors in Labels Often, human labelers add the incorrect ground truth. Again, despite Conllpp being a corrected dataset, and there only being 4 classes (Location, Person, Organization, Misc), there are still a number of mislabels. Using Galileo’s “Likely Mislabeled” alert card, Galileo exposes the mislabeled data with high precision. Again, with one click, we can fix these samples by re-labeling within Galileo, or exporting a labeling tool through Galileo’s integrations. ConclusionWe covered how to fine-tune a model for NER tasks using the powerful HuggingFace library and then use Galileo to inspect the quality of the model and dataset. This is only a fraction of what you can achieve using Galileo (more documentation here). Feel free to reach out here to get a personalized demo from a member of the Galileo data science team. Hope this proved useful, and happy building! *This post was written by Vikram Chatterji, the co-founder and CEO of Galileo. We thank Galileo for their support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Inside Alpaca: The Language Model from Stanford University that can Follow Instructions and Match GPT-3.5
Thursday, April 6, 2023
The model is based on Meta AI's LLaMA and remains significatively smaller than GPT-3.5.
🎙 ML platform podcast: Season 2 of MLOps Live from neptune.ai*
Wednesday, April 5, 2023
*This post was written by neptune.ai's team. We thank neptune.ai for their ongoing support of TheSequence. We ran MLOps live podcast for over a year. 29 incredible Q&A sessions with people
Edge 279: Cross-Silo Federating Learning
Tuesday, April 4, 2023
Cross-silo federated learning(FL), Amazon's research on personalized FL and IBM's FL framework.
📝 Guest Post: An introduction to Similarity Search*
Monday, April 3, 2023
In this guest post, Frank Liu, Director of Operations & ML Architect @ Zilliz, conducts a quick tour of Similarity Search, comparing embeddings and vector search strategies. An introduction to
The Controversial AI Moratorium Letter
Sunday, April 2, 2023
Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.
You Might Also Like
🤳🏻 We Need More High-End Small Phones — Linux Terminal Setup Tips
Sunday, November 24, 2024
Also: Why I Switched From Google Maps to Apple Maps, and More! How-To Geek Logo November 24, 2024 Did You Know Medieval moats didn't just protect castles from invaders approaching over land, but
JSK Daily for Nov 24, 2024
Sunday, November 24, 2024
JSK Daily for Nov 24, 2024 View this email in your browser A community curated daily e-mail of JavaScript news JavaScript Certification Black Friday Offer – Up to 54% Off! Certificates.dev, the trusted
OpenAI's turbulent early years - Sync #494
Sunday, November 24, 2024
Plus: Anthropic and xAI raise billions of dollars; can a fluffy robot replace a living pet; Chinese reasoning model DeepSeek R1; robot-dog runs full marathon; a $12000 surgery to change eye colour ͏ ͏
Daily Coding Problem: Problem #1618 [Easy]
Sunday, November 24, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Zillow. Let's define a "sevenish" number to be one which is either a power
PD#602 How Netflix Built Self-Healing System to Survive Concurrency Bug
Sunday, November 24, 2024
CPUs were dying, the bug was temporarily un-fixable, and they had no viable path forward
RD#602 What are React Portals?
Sunday, November 24, 2024
A powerful feature that allows rendering components outside their parent component's DOM hierarchy
C#533 What's new in C# 13
Sunday, November 24, 2024
Params collections support, a new Lock type and others
⚙️ Smaller but deeper: Writer’s secret weapon to better AI
Sunday, November 24, 2024
November 24, 2024 | Read Online Ian Krietzberg Good morning. I sat down recently with Waseem Alshikh, the co-founder and CTO of enterprise AI firm Writer. Writer recently made waves with the release of
Sunday Digest | Featuring 'How Often People Go to the Doctor, by Country' 📊
Sunday, November 24, 2024
Every visualization published this week, in one place. Nov 24, 2024 | View Online | Subscribe | VC+ | Download Our App Hello, welcome to your Sunday Digest. This week we visualized the GDP per capita
Android Weekly #650 🤖
Sunday, November 24, 2024
View in web browser 650 November 24th, 2024 Articles & Tutorials Sponsored Why your mobile releases are a black box “What's the status of the release?” Who knows. Uncover the unseen challenges