📝 Guest Post: RAG Evaluation Using Ragas*
Was this email forwarded to you? Sign up here In this guest post, the teams from Zilliz and Ragas discuss key RAG evaluation metrics, their calculation, and implementation using the Milvus vector database and the Ragas package. Let’s dive in! Retrieval, a cornerstone of Generative AI systems, is still challenging. Retrieval Augmented Generation, or RAG for short, is an approach to building AI-powered chatbots that answer questions based on data the AI model, an LLM, has been trained on. Evaluation data from sources like WikiEval show very low natural language retrieval accuracy. This means you will probably need to conduct experiments to tune RAG parameters for your GenAI system before deploying it. However, before you can do RAG experimentation, you need a way to evaluate which experiments had the best results! RAG EvaluationUsing Large Language Models (LLMs) as judges has gained prominence in modern RAG evaluation. This approach involves using powerful language models, like OpenAI’s GPT-4, to assess the quality of components in RAG systems. LLMs serve as judges by evaluating the relevance, precision, adherence to instructions, and overall quality of the responses produced by the RAG system. It might seem strange to ask an LLM to evaluate another LLM. According to research, GPT-4 agrees 80% of the time with human labelers. Apparently, humans (in AI terminology called the “Bayesian limit”) do not agree more than 80% with each other! Using the “LLM-as-judge” approach automates and speeds up evaluation and offers scalability while saving cost and time spent on manual human labeling. There are two primary flavors of LLM-as-judge for RAG evaluation:
The rest of this blog will showcase Ragas, which emphasizes automation and scalability for RAG evaluations. Evaluation Data Needed for RagasAccording to the Ragas documentation, your RAG pipeline evaluation will need four key data points.
Ragas Evaluation MetricsYou can find explanations for each metric, including their underlying formulas, in the documentation. For example, faithfulness. Some metrics are:
Details about how these metrics are calculated can be found in their paper. RAG Evaluation Code ExampleThis evaluation code assumes you already have a RAG demo. For my demo, I created a RAG chatbot using Milvus Technical documentation and Milvus vector database for retrieval. Full code for my demo RAG notebook and Eval notebooks are on GitHub. Using that RAG demo, I asked it questions, got the RAG contexts from Milvus, and generated bot responses from an LLM (see the last 2 columns below). Additionally, I provide “ground truth” answers to the same questions (“contexts” column below). You must install OpenAI, (HuggingFace) dataset, ragas, langchain, and pandas.
Convert the pandas dataframe to a HuggingFace Dataset.
The default LLM model Ragas uses is OpenAI’s `gpt-3.5-turbo-16k` and the default embedding model is `text-embedding-ada-002`. You can change both models to whatever you like. I’ll change the LLM-as-judge model to the pinned `gpt-3.5-turbo` since OpenAI’s latest blog announced this is the cheapest. I also changed the embedding model to `text-embedding-3-small` since the blog noted these new embeddings support compression-mode. In the code below, I’m only using the RAG context evaluation metrics to focus on measuring Retrieval quality.
You can see the full code for my demo RAG notebook and Eval notebooks on Git Hub. ConclusionThis blog explored the ongoing retrieval challenge in Generative AI, focusing on Retrieval Augmented Generation (RAG) for natural language AI. Experimentation is needed to optimize RAG parameters with your data using evaluations. Currently, evaluations can be automated using Large Language Models (LLMs) as judges. I discussed some key RAG evaluation metrics and their calculation, along with an implementation using the Milvus vector database and the Ragas package. *This post was originally published on Zilliz.com here. We thank Zilliz for their insights and ongoing support of TheSequence.You’re on the free list for TheSequence Scope and TheSequence Chat. For the full experience, become a paying subscriber to TheSequence Edge. Trusted by thousands of subscribers from the leading AI labs and universities. |
Older messages
Edge 420: Inside FlashAttention-3, The Algorithm Pushing the New Wave of Transformers
Thursday, August 8, 2024
The new algorithm takes full advantage of the capabilities of H100 GPUs. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Edge 419: Everything You Need to Know About Autonomous Agents in 19 Posts
Tuesday, August 6, 2024
A summary of our long series about automous agents. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Gemma 2: A Release That Matters
Sunday, August 4, 2024
A new model, a guardrails framework and an interpretability tool. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
Gemma 2: A Release That Matters
Sunday, August 4, 2024
A new model, a guardrails framework and an interpretability tool. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
📽 [Webinar] Beat GPT-4 with a Small Model and 10 Rows of Data*
Friday, August 2, 2024
Small language models (SLMs) are increasingly rivaling the performance of large foundation models like GPT-4. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
You Might Also Like
💻 Issue 428 - C# different way to do a proof of concept
Thursday, September 19, 2024
This week's Awesome .NET Weekly Read this email on the Web The Awesome .NET Weekly Issue » 428 Release Date Sep 19, 2024 Your weekly report of the most popular .NET news, articles and projects
💎 Issue 435 - Ruby-SAML pwned by XML signature wrapping attacks
Thursday, September 19, 2024
This week's Awesome Ruby Newsletter Read this email on the Web The Awesome Ruby Newsletter Issue » 435 Release Date Sep 19, 2024 Your weekly report of the most popular Ruby news, articles and
💻 Issue 435 - Oracle, it's time to free JavaScript
Thursday, September 19, 2024
This week's Awesome JavaScript Weekly Read this email on the Web The Awesome JavaScript Weekly Issue » 435 Release Date Sep 19, 2024 Your weekly report of the most popular JavaScript news, articles
📱 Issue 429 - iOS 18 breaks IMAPS self-signed certs
Thursday, September 19, 2024
This week's Awesome iOS Weekly Read this email on the Web The Awesome iOS Weekly Issue » 429 Release Date Sep 19, 2024 Your weekly report of the most popular iOS news, articles and projects Popular
💻 Issue 353 - Why React Won the Front-End Race
Thursday, September 19, 2024
This week's Awesome React Weekly Read this email on the Web The Awesome React Weekly Issue » 353 Release Date Sep 19, 2024 Your weekly report of the most popular React news, articles and projects
💻 Issue 435 - DevSecOps Project: "Secure Full-Stack Node.js Web Application Deployment with Jenkins, Docker, Kubernetes, and HashiCorp Vault"
Thursday, September 19, 2024
This week's Awesome Node.js Weekly Read this email on the Web The Awesome Node.js Weekly Issue » 435 Release Date Sep 19, 2024 Your weekly report of the most popular Node.js news, articles and
📱 Issue 432 - Swift 6
Thursday, September 19, 2024
This week's Awesome Swift Weekly Read this email on the Web The Awesome Swift Weekly Issue » 432 Release Date Sep 19, 2024 Your weekly report of the most popular Swift news, articles and projects
💻 Issue 430 - Days since last Minecraft server written in Rust was released
Thursday, September 19, 2024
This week's Awesome Rust Weekly Read this email on the Web The Awesome Rust Weekly Issue » 430 Release Date Sep 19, 2024 Your weekly report of the most popular Rust news, articles and projects
Ranked | The Largest Producers of Wind Power, by Country ⚡
Thursday, September 19, 2024
Global wind power capacity hit fresh records in 2023 thanks to strategic government investment and lower technology costs. View Online | Subscribe | Download Our App Presented by: NEW REPORT: Brought
🧠 ChatGPT Passed the Turing Test — 5 Tips to Make Your Laptop Last Longer
Thursday, September 19, 2024
Also: How to Sideload Apps on Android TV, and More! How-To Geek Logo September 19, 2024 Did You Know Babies seem to have such large eyes because humans are born with eyes approximately 75 percent of