📝 Guest Post: RAG Evaluation Using Ragas*
Was this email forwarded to you? Sign up here In this guest post, the teams from Zilliz and Ragas discuss key RAG evaluation metrics, their calculation, and implementation using the Milvus vector database and the Ragas package. Let’s dive in! Retrieval, a cornerstone of Generative AI systems, is still challenging. Retrieval Augmented Generation, or RAG for short, is an approach to building AI-powered chatbots that answer questions based on data the AI model, an LLM, has been trained on. Evaluation data from sources like WikiEval show very low natural language retrieval accuracy. This means you will probably need to conduct experiments to tune RAG parameters for your GenAI system before deploying it. However, before you can do RAG experimentation, you need a way to evaluate which experiments had the best results! RAG EvaluationUsing Large Language Models (LLMs) as judges has gained prominence in modern RAG evaluation. This approach involves using powerful language models, like OpenAI’s GPT-4, to assess the quality of components in RAG systems. LLMs serve as judges by evaluating the relevance, precision, adherence to instructions, and overall quality of the responses produced by the RAG system. It might seem strange to ask an LLM to evaluate another LLM. According to research, GPT-4 agrees 80% of the time with human labelers. Apparently, humans (in AI terminology called the “Bayesian limit”) do not agree more than 80% with each other! Using the “LLM-as-judge” approach automates and speeds up evaluation and offers scalability while saving cost and time spent on manual human labeling. There are two primary flavors of LLM-as-judge for RAG evaluation:
The rest of this blog will showcase Ragas, which emphasizes automation and scalability for RAG evaluations. Evaluation Data Needed for RagasAccording to the Ragas documentation, your RAG pipeline evaluation will need four key data points.
Ragas Evaluation MetricsYou can find explanations for each metric, including their underlying formulas, in the documentation. For example, faithfulness. Some metrics are:
Details about how these metrics are calculated can be found in their paper. RAG Evaluation Code ExampleThis evaluation code assumes you already have a RAG demo. For my demo, I created a RAG chatbot using Milvus Technical documentation and Milvus vector database for retrieval. Full code for my demo RAG notebook and Eval notebooks are on GitHub. Using that RAG demo, I asked it questions, got the RAG contexts from Milvus, and generated bot responses from an LLM (see the last 2 columns below). Additionally, I provide “ground truth” answers to the same questions (“contexts” column below). You must install OpenAI, (HuggingFace) dataset, ragas, langchain, and pandas.
Convert the pandas dataframe to a HuggingFace Dataset.
The default LLM model Ragas uses is OpenAI’s `gpt-3.5-turbo-16k` and the default embedding model is `text-embedding-ada-002`. You can change both models to whatever you like. I’ll change the LLM-as-judge model to the pinned `gpt-3.5-turbo` since OpenAI’s latest blog announced this is the cheapest. I also changed the embedding model to `text-embedding-3-small` since the blog noted these new embeddings support compression-mode. In the code below, I’m only using the RAG context evaluation metrics to focus on measuring Retrieval quality.
You can see the full code for my demo RAG notebook and Eval notebooks on Git Hub. ConclusionThis blog explored the ongoing retrieval challenge in Generative AI, focusing on Retrieval Augmented Generation (RAG) for natural language AI. Experimentation is needed to optimize RAG parameters with your data using evaluations. Currently, evaluations can be automated using Large Language Models (LLMs) as judges. I discussed some key RAG evaluation metrics and their calculation, along with an implementation using the Milvus vector database and the Ragas package. *This post was originally published on Zilliz.com here. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 420: Inside FlashAttention-3, The Algorithm Pushing the New Wave of Transformers
Thursday, August 8, 2024
The new algorithm takes full advantage of the capabilities of H100 GPUs. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 419: Everything You Need to Know About Autonomous Agents in 19 Posts
Tuesday, August 6, 2024
A summary of our long series about automous agents. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Gemma 2: A Release That Matters
Sunday, August 4, 2024
A new model, a guardrails framework and an interpretability tool. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Gemma 2: A Release That Matters
Sunday, August 4, 2024
A new model, a guardrails framework and an interpretability tool. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
📽 [Webinar] Beat GPT-4 with a Small Model and 10 Rows of Data*
Friday, August 2, 2024
Small language models (SLMs) are increasingly rivaling the performance of large foundation models like GPT-4. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
Post from Syncfusion Blogs on 12/26/2024
Thursday, December 26, 2024
New blogs from Syncfusion Create a Flutter 3D Column Chart to Showcase the Top 6 Renewable Energy-Consuming Countries By Praveen Balu Let's visualize the top 6 renewable energy-consuming countries
Ruijie Networks' Cloud Platform Flaws Could Expose 50,000 Devices to Remote Attacks
Thursday, December 26, 2024
THN Daily Updates Newsletter cover Improve IT Efficiency with a Standardized OS: Nine considerations for building a standardized operating environment Optimize your IT with a standardized operating
Edge 460: Anthropic's New Protocol to Link AI Assistants to Data Sources
Thursday, December 26, 2024
Model Context Protocols is one of the recent AI contributions of the AI lab. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
December 26th 2024
Thursday, December 26, 2024
Curated news all about PHP. Here's the latest edition Is this email not displaying correctly? View it in your browser. PHP Weekly 26th December 2024 Hi everyone, It's boxing day in some parts
Re: This took me 10 minutes and protects my privacy
Thursday, December 26, 2024
Christmas may be over, but you still have one more chance to take advantage of Incogni's amazing holiday promotion! Protect your personal data from hackers and scammers today with Incogni's 58%
Daily Coding Problem: Problem #1648 [Medium]
Wednesday, December 25, 2024
Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Quora. Given an absolute pathname that may have . or .. as part of it, return the
🎮 The Best Games to Go With Your New Console — Streaming Services Could Learn From YouTube
Wednesday, December 25, 2024
Also: Don't Throw Christmas Gift Boxes on the Curb, and More! How-To Geek Logo December 25, 2024 Did You Know Years before The Nightmare Before Christmas, Tim Burton was sprinkling references to
Charted | Global Economic Confidence in 2025, by Country 🌎
Wednesday, December 25, 2024
While emerging markets in Asia have the strongest confidence in the global economy looking ahead, European countries are most pessimistic. View Online | Subscribe | Download Our App FEATURED STORY
Top Tech Deals 🎅 Sony Headphones, iPhone Cases, 4K Projector, and More!
Wednesday, December 25, 2024
The season of giving is upon us. How-To Geek Logo December 25, 2024 Top Tech Deals: Sony Headphones, iPhone Cases, 4K Projector, and More! The season of giving is upon us. Happy Holidays! If you're
Why the Race to AGI is Humanitys Defining Moment
Wednesday, December 25, 2024
Top Tech Content sent at Noon! Boost Your Article on HackerNoon for $159.99! Read this email in your browser How are you, @newsletterest1? 🪐 What's happening in tech today, December 25, 2024? The