RAG, or Retrieval Augmented Generation, is a prominent AI framework in the era of large language models (LLMs) like ChatGPT. It enhances the capabilities of these models by integrating external knowledge, ensuring more accurate and current responses. A standard RAG system includes an LLM, a vector database like Milvus, and some prompts as code. |
As more and more developers and businesses adopt RAG for building GenAI applications, evaluating their effectiveness is becoming increasingly important. In another post, we evaluated the performance of two different RAG systems built with the OpenAI Assistants and the Milvus vector database, which shed some light on assessing RAG systems. This post will dive deeper and discuss the methodologies used to evaluate RAG applications. We'll also introduce some powerful evaluation tools and highlight standard metrics. |
RAG evaluation metrics |
Evaluating RAG applications is more than simply comparing a few examples. The key lies in using convincing, quantitative, and reproducible metrics to assess these applications. In this journey, we’ll introduce three categories of metrics: |
Metrics based on the ground truth Metrics without the ground truth Metrics based on LLM responses
|
Metrics based on the ground truth |
Ground truth refers to well-established answers or knowledge document chunks in a dataset corresponding to user queries. When the ground truth is the answers, we can directly compare the ground truth with the RAG responses, facilitating an end-to-end measurement using metrics like answer semantic similarity and answer correctness. |
Below is an example of evaluating answers based on their correctness. |
Ground truth: Einstein was born in 1879 in Germany. High answer correctness: In 1879, in Germany, Einstein was born. Low answer correctness: In Spain, Einstein was born in 1879. |
|
Where ground truth is chunks from the knowledge document, we can evaluate the correlation between the document chunks and the retrieved contexts using traditional metrics such as Exact Match (EM), Rouge-L, and F1. In essence, we are evaluating the retrieval effectiveness of RAG applications. |
How to generate the ground truth for your own dataset |
We have now established the importance of using datasets with the ground truth for evaluating RAG applications. However, what if you want to assess a RAG application using your private datasets without annotated ground truth? How do you generate the required ground truth for your datasets? |
The simplest method is to ask an LLM like ChatGPT to generate sample questions and answers based on your proprietary dataset. Tools like Ragas and LlamaIndex also provide methods for generating test data tailored to your knowledge documents. |
| The sample questions and answers generated by the Ragas evaluation tool. Image Credit: Ragas |
|
These generated test datasets, comprising questions, context, and corresponding answers, facilitate quantitative evaluation without reliance on unrelated external baseline datasets. This approach empowers users to assess RAG systems using their unique data, ensuring a more customized and meaningful evaluation process. |
Metrics without the ground truth |
We can still evaluate RAG applications without a ground truth for each query. TruLens-Eval, an open-source evaluation tool, innovates the concept of the RAG Triad, which focuses on evaluating the relevance of elements in the query, context, and response triplets. Three corresponding metrics are: |
Context Relevance: Measures how well the retrieved context supports the query. Groundedness: Assesses the extent to which the LLM's response aligns with the retrieved context. Answer Relevance: Gauges the relevance of the final response to the query.
|
Below is an example of evaluating answers based on their relevance to the question. |
Question: Where is France and what is its capital? Low relevance answer: France is in western Europe. High relevance answer: France is in western Europe and Paris is its capital. |
|
|
Additionally, these triad metrics can be further subdivided, enhancing the granularity of evaluation. For example, Ragas (an open-source framework dedicated to evaluating the performance of RAG systems) has split context relevancy into three further detailed metrics: context precision, context relevance, and context recall. |
Metrics Based on LLM responses |
This category of metrics evaluates LLM responses, considering factors such as friendliness, harmfulness, and conciseness. For example, LangChain proposes metrics such as conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, criminality, and insensitivity. |
Below is an example of evaluating answers based on their conciseness. |
Question: What's 2+2? Low conciseness answer: What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four. High conciseness answer: 4 |
|
Using LLMs to score the metrics |
Most metrics mentioned earlier require inputting text to obtain a score, which takes work. The good news is that this process becomes more manageable with the advent of LLMs like GPT-4, where all you need to do is design a suitable prompt. |
The paper "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" proposes a prompt design for GPT-4 to judge the quality of an AI assistant's response to a user question. Below is a quick example: |
[System] | Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, please rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]". | [Question] | {question} | [The Start of Assistant's Answer] | {answer} | [The End of Assistant's Answer] | This prompt asks GPT-4 to evaluate the response quality and rate them on a scale of 1 to 10. |
|
|
It is important to note that GPT-4, like any judge, is not infallible and might have biases and potential errors. So, prompt design is crucial. Advanced prompt engineering techniques such as multi-shot or Chain-of-Thought (CoT) may be necessary. Fortunately, we don’t have to worry about this problem because many evaluation tools for RAG applications have already integrated well-designed prompts. |
RAG evaluation tools |
Now that we have covered evaluating a RAG application, let's explore some tools for assessing RAG applications, offering insights into how they work and what use case fits these tools best. |
Ragas: streamlined RAG evaluation |
Ragas is an open-source evaluation tool for assessing RAG applications. With a simple interface, Ragas streamlines the evaluation process. By creating a dataset instance in the required format, users can quickly initiate evaluations and obtain metrics such as 'ragas_score,' 'context_precision,' 'faithfulness,' and 'answer_relevancy.' |
```python | from ragas import evaluate | from datasets import Dataset | | dataset: Dataset | | results = evaluate(dataset) | # {'ragas_score': 0.860, 'context_precision': 0.817, | # 'faithfulness': 0.892, 'answer_relevancy': 0.874} | ``` |
|
|
Ragas supports a variety of metrics and imposes no specific framework requirements, offering flexibility in evaluating different RAG applications. Ragas enables real-time monitoring of evaluations via LangSmith, offering insights into each assessment's reasons and API key consumption. |
LlamaIndex: building and evaluating with ease |
LlamaIndex is a robust AI framework for building RAG applications, including an RAG evaluation tool. It is handy for assessing applications built within its framework. |
```python | from llama_index.evaluation import BatchEvalRunner | from llama_index.evaluation import FaithfulnessEvaluator, RelevancyEvaluator | | service_context_gpt4 = ... | vector_index = ... | question_list = ... | | faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4) | relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4) | | runner = BatchEvalRunner( | {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4}, | workers=8, | ) | | eval_results = runner.evaluate_queries( | vector_index.as_query_engine(), queries=question_list | ) |
|
|
TruLens-Eval: integrated evaluation for diverse frameworks |
TruLens Eval provides an easy way to evaluate RAG applications built with LangChain and LlamaIndex. The following code snippet shows how to set up the evaluation for a LangChain-based RAG application. |
```python | from trulens_eval import TruChain, Feedback, Tru, Select | from trulens_eval.feedbackimport Groundedness | from trulens_eval.feedback.provider import OpenAI | import numpy as np | | tru = Tru() | rag_chain = ... | | # Initialization and feedback setup... | | tru_recorder = TruChain(rag_chain, | app_id='Chain1_ChatApplication', | feedbacks=[f_qa_relevance, f_groundedness]) | | tru.run_dashboard() | ``` |
|
|
Trulens-Eval can assess RAG apps built with other frameworks, but implementing it in code can be complex. Refer to the official documentation for more details. |
In addition, Trulens-Eval also offers visual monitoring in the browser for analyzing evaluation reasons and observing API key usage. |
Phoenix: evaluating LLM with flexibility |
Phoenix provides a complete set of metrics to evaluate LLMs, including the quality of generated embeddings and the LLM’s responses. It can also assess RAG applications but includes fewer metrics than the other mentioned evaluation tools. The following code snippet shows how to use Phoenix to evaluate a RAG application built by LlamaIndex. |
```python | import phoenix as px | from llama_index import set_global_handler | from phoenix.experimental.evals import llm_classify, OpenAIModel, RAG_RELEVANCY_PROMPT_TEMPLATE, \ | RAG_RELEVANCY_PROMPT_RAILS_MAP | from phoenix.session.evaluation import get_retrieved_documents | | px.launch_app() | set_global_handler("arize_phoenix") | print("phoenix URL", px.active_session().url) | | query_engine = ... | question_list = ... | | for question in question_list: | response_vector = query_engine.query(question) | | retrieved_documents = get_retrieved_documents(px.active_session()) | | retrieved_documents_relevance = llm_classify( | dataframe=retrieved_documents, | model=OpenAIModel(model_name="gpt-4-1106-preview"), | template=RAG_RELEVANCY_PROMPT_TEMPLATE, | rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()), | provide_explanation=True, | ) | ``` |
|
|
Other tools |
Apart from the above-mentioned tools, other platforms like DeepEval, LangSmith, and OpenAI Evals also offer capabilities for evaluating RAG applications. Their methodologies are similar, but prompt design and implementation specifics vary, so be sure to pick the tool that works best for you. |
Summary |
In conclusion, we reviewed a few methodologies, metrics, and RAG application evaluation tools. In particular, we explored three categories of metrics: |
those based on a ground truth, those without a ground truth, and those based on the responses of large language models (LLMs).
|
Ground truth metrics involve comparing RAG responses with established answers. In contrast, metrics without ground truth, such as the RAG Triad, focus on evaluating the relevance between the queries, context, and responses. Metrics based on LLM responses consider friendliness, harmfulness, and conciseness. |
We also explored using LLMs for scoring metrics through well-designed prompts and introduced a set of RAG evaluation tools, including Ragas, LlamaIndex, TruLens-Eval, and Phoenix to help in this task. |
In the fast-changing world of AI, regularly evaluating and enhancing RAG applications is crucial for their reliability. Using the methodologies, metrics, and tools discussed here, developers and businesses can make informed decisions about the performance and capabilities of their RAG systems, driving the progress of AI applications. |
|
*This post was written by Cheney Zhang and originally published on Zilliz.com here. We thank Zilliz for their insights and ongoing support of Turing Post. |
|
|