Deep Insight

Evaluating Retrieval-Augmented Generation Systems: A quick guide

Many companies, particularly in sectors like legal, finance, and real estate, manage vast document databases. Their daily tasks often involve quickly retrieving information from these databases to make informed decisions. Effective document retrieval is a key component of professional knowledge management. Increasingly, companies are leveraging AI techniques to help employees access their private knowledge and resources more efficiently. One such AI technique, Retrieval-Augmented Generation (RAG), combines the power of semantic search with Large Language Models (LLMs), offering powerful capabilities to meet this crucial need.

As with any AI system, building a robust RAG system requires numerous iterations over two major phases: (1) Development/Improvement and (2) Evaluation. The goal of this article is to focus on the evaluation process, discussing key concepts and metrics to consider when assessing the performance of a RAG system. Proper evaluation helps identify which components require improvement to enhance the system’s overall effectiveness.

In essence, a RAG system takes both questions and documents (whether private or public) as inputs and generates answers. These answers may or may not have a predefined “ground truth.” The RAG system is composed of two primary components: retrieval and augmented generation.

The retrieval process consists of two sub-processes: embedding and semantic search. Embedding involves converting a question into an embedded vector, while the private document database is transformed into a vector database or vector store. Semantic search then identifies document chunks that are semantically relevant to the question. The output of this process is a set of retrieved documents, which serve as augmented context for the next generation stage.

The augmented generation process merges the embedded question with the retrieved documents to form a prompt, using prompt engineering techniques. This enriched prompt helps guide the Large Language Model (LLM)—such as OpenAI’s GPT-3.5, GPT-4, or Llama 3.1, among others—in generating a relevant response in natural language.

Evaluating a RAG system involves assessing the performance of both its sub-components: retrieval and generation. Specifically, this means measuring the relevance between the question, the retrieved documents, the generated answer, and, if available, the ground truth. These elements represent the key inputs and outputs of the two components in the RAG system.

To evaluate a retrieval system effectively, one commonly used framework is the RAG Triad, developed by TruEra. This framework assesses Retrieval-Augmented Generation (RAG) applications for potential hallucinations across three key dimensions: context relevance, groundedness (faithfulness), and answer relevance.

Context Relevance, Groundedness, and Answer Relevance

Context Relevance: Evaluates whether the retrieved context is relevant to the input query or question. This component answers the question: “Is the context relevant to the query?”

Groundedness (Faithfulness): Determines whether the LLM’s response is supported by and grounded in the retrieved context. This component answers the question: “Is the response backed by the retrieved context?”

Answer Relevance: Verifies whether the final response is directly relevant to and effectively answers the original user query. This component answers the question: “Is the answer relevant to the query?”

It is evident that context relevance allows us to evaluate the retrieval process, while groundedness and answer relevance are used to evaluate the generation process. The evaluation concepts outlined above is clear and effective, especially for qualitative and human evaluations. However, when automating the RAG evaluation process, the key questions become: How can we measure these aspects quantitatively? and Which metrics correspond to the dimensions of the RAG Triad?

In the following section, we introduce four common metrics designed for this purpose: Context Precision and Context Recall for evaluating context relevance in the retrieval process, and Faithfulness and Answer Relevance for evaluating the generation process.

It’s important to note that various libraries and frameworks, such as Ragas, RAGChecker, and DeepEval, provide detailed formulas and implementations of these metrics. I highly recommend exploring their documentation for more technical insights. In this post, however, our focus is on offering an intuitive, easy-to-understand guide to these metrics.

Retrieval Evaluation Metrics

Context Precision: This metric measures the proportion of retrieved documents that are relevant to the query. High context precision indicates that the system is retrieving accurate and pertinent information, which is crucial for generating reliable responses.

Context Recall: Context recall evaluates the proportion of relevant documents that were successfully retrieved out of all possible relevant documents. A high context recall ensures that the system comprehensively captures all necessary information for generating a complete response.

Generation Evaluation Metrics

Faithfulness: Faithfulness assesses how well the generated response aligns with the retrieved context. It measures the proportion of claims in the response that are grounded in the retrieved information. A high faithfulness score ensures that the content is factually accurate based on the provided context.

Answer Relevance: This metric evaluates how well the generated response addresses the user’s query and provides useful information.

Demonstration

To demonstrate RAG evaluation, we set up a sample configuration for a RAG system using synthetic data. This setup is inspired from this great notebook (https://huggingface.co/learn/cookbook/en/rag_evaluation) by Hugging Face on preparing synthetic data for RAG systems, benchmarking, and evaluation. We then use the Ragas framework to implement the four metrics described above. Below are an example and visualization based on the results.

Question: What is the name of the convolutional neural architecture that incorporates residual connections and is used for adversarial training?

Answer: Inception-ResNet-v2 is the name of the convolutional neural architecture that incorporates residual connections. (Source: Document 0)

Contexts :

**Inception-ResNet-v2** is a convolutional neural architecture that builds on the Inception family of architectures but incorporates [residual connections](https://paperswithcode.com/method/residual-connection) (replacing the filter concatenation stage of the Inception architecture).

Ground Truth: Inception-ResNet-v2
Context Precision: 1
Context Recall: 1
Faithfulness: 1
Answer relevance: 0.86

In this example, both the retrieval and generation components achieve high scores, giving us confidence in the accuracy and relevance of the generated answer. For further examples, please check out our resulting dataset at https://huggingface.co/datasets/tanquangduong/eval-rag-results-gpt4o

Conclusion

To wrap up, this article began by highlighting the importance of building a robust knowledge retrieval system for companies that rely on document databases to make informed decisions. We then introduced the core components of a Retrieval-Augmented Generation (RAG) system. Next, we explored various aspects of RAG evaluation, including the RAG Triad framework, different libraries and frameworks for evaluation, and the key metrics used for assessing a RAG system. Finally, we presented an example to demonstrate how to interpret RAG evaluation results.

We hope this article provides you with a quick guide to understanding of the key concepts and metrics needed to build and evaluate your RAG system effectively.