Notebook
Evaluating RAG
Introduction¶
Evaluation is a critical component in the development and optimization of Retrieval-Augmented Generation (RAG) systems. It involves assessing the performance, accuracy, and quality of various aspects of the RAG pipeline, from retrieval effectiveness to the relevance and faithfulness of generated responses.
Importance of Evaluation in RAG¶
Effective evaluation of RAG systems is essential because it:
- Helps identify strengths and weaknesses in the retrieval and generation processes.
- Guides improvements and optimizations across the RAG pipeline.
- Ensures the system meets quality standards and user expectations.
- Facilitates comparison between different RAG implementations or configurations.
- Helps detect issues such as hallucinations, biases, or irrelevant responses.
Key Evaluation Metrics¶
RAGAS Metrics¶
- Faithfulness: Measures how well the generated response aligns with the retrieved context.
- Answer Relevancy: Assesses the relevance of the response to the query.
- Context Recall: Evaluates how well the retrieved chunks cover the information needed to answer the query.
- Context Precision: Measures the proportion of relevant information in the retrieved chunks.
- Context Utilization: Assesses how effectively the generated response uses the provided context.
- Context Entity Recall: Evaluates the coverage of important entities from the context in the response.
- Noise Sensitivity: Measures the system's robustness to irrelevant or noisy information.
- Summarization Score: Assesses the quality of summarization in the response.
DeepEval Metrics¶
- G-Eval: A general evaluation metric for text generation tasks.
- Summarization: Assesses the quality of text summarization.
- Answer Relevancy: Measures how well the response answers the query.
- Faithfulness: Evaluates the accuracy of the response with respect to the source information.
- Contextual Recall and Precision: Measures the effectiveness of context retrieval.
- Hallucination: Detects fabricated or inaccurate information in the response.
- Toxicity: Identifies harmful or offensive content in the response.
- Bias: Detects unfair prejudice or favoritism in the generated content.
Trulens Metrics¶
- Context Relevance: Assesses how well the retrieved context matches the query.
- Groundedness: Measures how well the response is supported by the retrieved information.
- Answer Relevance: Evaluates how well the response addresses the query.
- Comprehensiveness: Assesses the completeness of the response.
- Harmful/Toxic Language: Identifies potentially offensive or dangerous content.
- User Sentiment: Analyzes the emotional tone of user interactions.
- Language Mismatch: Detects inconsistencies in language use between query and response.
- Fairness and Bias: Evaluates the system for equitable treatment across different groups.
- Custom Feedback Functions: Allows for tailored evaluation metrics specific to use cases.
Best Practices for RAG Evaluation¶
- Comprehensive Evaluation: Use a combination of metrics to assess different aspects of the RAG system.
- Regular Benchmarking: Continuously evaluate the system as changes are made to the pipeline.
- Human-in-the-Loop: Incorporate human evaluation alongside automated metrics for a holistic assessment.
- Domain-Specific Metrics: Develop custom metrics relevant to your specific use case or domain.
- Error Analysis: Investigate patterns in low-scoring responses to identify areas for improvement.
- Comparative Evaluation: Benchmark your RAG system against baseline models and alternative implementations.
Conclusion¶
A robust evaluation framework is crucial for developing and maintaining high-quality RAG systems. By leveraging a diverse set of metrics and following best practices, developers can ensure their RAG systems deliver accurate, relevant, and trustworthy responses while continuously improving performance.
Setting up the Environment¶
!pip install llama-index
!pip install llama-index-vector-stores-qdrant
!pip install llama-index-readers-file
!pip install llama-index-embeddings-fastembed
!pip install llama-index-llms-openai
!pip install llama-index-llms-groq
!pip install -U qdrant_client fastembed
!pip install python-dotenv
# Standard library imports
import logging
import sys
import os
# Third-party imports
from dotenv import load_dotenv
from IPython.display import Markdown, display
# Qdrant client import
import qdrant_client
# LlamaIndex core imports
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
# LlamaIndex vector store import
from llama_index.vector_stores.qdrant import QdrantVectorStore
# Embedding model imports
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
# LLM import
from llama_index.llms.openai import OpenAI
# from llama_index.llms.groq import Groq
# Load environment variables
load_dotenv()
# Get OpenAI API key from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GROK_API_KEY = os.getenv("GROQ_API_KEY")
# Setting up Base LLM
Settings.llm = OpenAI(
model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True
)
# Settings.llm = Groq(model="llama3-70b-8192" , api_key=GROK_API_KEY)
# Set the embedding model
# Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default)
# Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Option 2: Use OpenAI's embedding model (commented out)
# If you want to use OpenAI's embedding model, uncomment the following line:
Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY)
# Qdrant configuration (commented out)
# If you're using Qdrant, uncomment and set these variables:
# QDRANT_CLOUD_ENDPOINT = os.getenv("QDRANT_CLOUD_ENDPOINT")
# QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
# Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version
Load the Data¶
# lets loading the documents using SimpleDirectoryReader
print("🔃 Loading Data")
from llama_index.core import Document
reader = SimpleDirectoryReader("../data/" , recursive=True)
documents = reader.load_data(show_progress=True)
Setting up Vector Database¶
We will be using qDrant as the Vector database There are 4 ways to initialize qdrant
- Inmemory
client = qdrant_client.QdrantClient(location=":memory:")
- Disk
client = qdrant_client.QdrantClient(path="./data")
- Self hosted or Docker
client = qdrant_client.QdrantClient(
# url="http://<host>:<port>"
host="localhost",port=6333
)
- Qdrant cloud
client = qdrant_client.QdrantClient(
url=QDRANT_CLOUD_ENDPOINT,
api_key=QDRANT_API_KEY,
)
for this notebook we will be using qdrant cloud
# creating a qdrant client instance
client = qdrant_client.QdrantClient(
# you can use :memory: mode for fast and light-weight experiments,
# it does not require to have Qdrant deployed anywhere
# but requires qdrant-client >= 1.1.1
# location=":memory:"
# otherwise set Qdrant instance address with:
# url=QDRANT_CLOUD_ENDPOINT,
# otherwise set Qdrant instance with host and port:
host="localhost",
port=6333
# set API KEY for Qdrant Cloud
# api_key=QDRANT_API_KEY,
# path="./db/"
)
vector_store = QdrantVectorStore(client=client, collection_name="01_RAG_Evaluation")
Ingest Data into vector DB¶
## ingesting data into vector database
## lets set up an ingestion pipeline
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(
transformations=[
# MarkdownNodeParser(include_metadata=True),
# TokenTextSplitter(chunk_size=500, chunk_overlap=20),
SentenceSplitter(chunk_size=1024, chunk_overlap=20),
# SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),
Settings.embed_model,
],
vector_store=vector_store,
)
# Ingest directly into a vector db
nodes = pipeline.run(documents=documents , show_progress=True)
print("Number of chunks added to vector DB :",len(nodes))
Setting Up Index¶
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
Modifying Prompts and Prompt Tuning¶
from llama_index.core import ChatPromptTemplate
qa_prompt_str = (
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge, "
"answer the question: {query_str}\n"
)
refine_prompt_str = (
"We have the opportunity to refine the original answer "
"(only if needed) with some more context below.\n"
"------------\n"
"{context_msg}\n"
"------------\n"
"Given the new context, refine the original answer to better "
"answer the question: {query_str}. "
"If the context isn't useful, output the original answer again.\n"
"Original Answer: {existing_answer}"
)
# Text QA Prompt
chat_text_qa_msgs = [
("system","You are a AI assistant who is well versed with answering questions from the provided context"),
("user", qa_prompt_str),
]
text_qa_template = ChatPromptTemplate.from_messages(chat_text_qa_msgs)
# Refine Prompt
chat_refine_msgs = [
("system","Always answer the question, even if the context isn't helpful.",),
("user", refine_prompt_str),
]
refine_template = ChatPromptTemplate.from_messages(chat_refine_msgs)
Example of Retrivers¶
- Query Engine
- Chat Engine
# Setting up Query Engine
BASE_RAG_QUERY_ENGINE = index.as_query_engine(
similarity_top_k=5,
text_qa_template=text_qa_template,
refine_template=refine_template,)
response = BASE_RAG_QUERY_ENGINE.query("How many encoders are stacked in the encoder?")
display(Markdown(str(response)))
# Setting up Chat Engine
BASE_RAG_CHAT_ENGINE = index.as_chat_engine()
response = BASE_RAG_CHAT_ENGINE.chat("How many encoders are stacked in the encoder?")
display(Markdown(str(response)))
Setup Observability¶
!pip install arize-phoenix
!pip install openinference-instrumentation-llama-index
!pip install -U llama-index-callbacks-arize-phoenix
import phoenix as px
# (session := px.launch_app()).view()
from openinference.instrumentation.langchain import LangChainInstrumentor
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import SpanLimits, TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000))
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))
LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
# LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
Generating Test Dataset¶
Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.
!pip install ragas
import os
import pandas as pd
from phoenix.trace import using_project
from ragas.testset.evolutions import multi_context, reasoning, simple
from ragas.testset.generator import TestsetGenerator
TEST_SIZE = 5
CACHE_FILE = "eval_testset.csv"
def generate_and_save_testset():
# Generator with openai models
generator = TestsetGenerator.with_openai()
# Set question type distribution
distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}
# Generate testset
with using_project("ragas-testset"):
testset = generator.generate_with_llamaindex_docs(
documents, test_size=TEST_SIZE, distributions=distribution
)
test_df = (
testset.to_pandas()
.sort_values("question")
.drop_duplicates(subset=["question"], keep="first")
)
# Save the dataset locally
test_df.to_csv(CACHE_FILE, index=False)
print(f"Test dataset saved to {CACHE_FILE}")
return test_df
def load_or_generate_testset():
if os.path.exists(CACHE_FILE):
print(f"Loading existing test dataset from {CACHE_FILE}")
test_df = pd.read_csv(CACHE_FILE)
else:
print("Generating new test dataset...")
test_df = generate_and_save_testset()
return test_df
# Main execution
test_df = load_or_generate_testset()
print(test_df.head(5))
You are free to change the question type distribution according to your needs. Since we now have our test dataset ready, let’s move on and build a simple RAG pipeline using LlamaIndex.
RAGAS¶
from phoenix.trace.dsl.helpers import SpanQuery
client = px.Client()
corpus_df = px.Client().query_spans(
SpanQuery().explode(
"embedding.embeddings",
text="embedding.text",
vector="embedding.vector",
),
project_name="indexing",
)
corpus_df.head(2)
import pandas as pd
from datasets import Dataset
from phoenix.trace import using_project
from tqdm.auto import tqdm
def generate_response(query_engine, question):
response = query_engine.query(question)
return {
"answer": response.response,
"contexts": [c.node.get_content() for c in response.source_nodes],
}
def generate_ragas_dataset(query_engine, test_df):
test_questions = test_df["question"].values
responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]
dataset_dict = {
"question": test_questions,
"answer": [response["answer"] for response in responses],
"contexts": [response["contexts"] for response in responses],
"ground_truth": test_df["ground_truth"].values.tolist(),
}
ds = Dataset.from_dict(dataset_dict)
return ds
with using_project("llama-index"):
ragas_eval_dataset = generate_ragas_dataset(BASE_RAG_QUERY_ENGINE, test_df)
ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
ragas_evals_df.head(2)
from phoenix.trace.dsl.helpers import SpanQuery
# dataset containing embeddings for visualization
query_embeddings_df = px.Client().query_spans(
SpanQuery().explode("embedding.embeddings", text="embedding.text", vector="embedding.vector"),
project_name="llama-index",
)
query_embeddings_df.head(2)
from phoenix.session.evaluation import get_qa_with_reference
# dataset containing span data for evaluation with Ragas
spans_dataframe = get_qa_with_reference(client, project_name="llama-index")
spans_dataframe.head(2)
Ragas uses LangChain to evaluate your LLM application data. Since we initialized the LangChain instrumentation above we can see what's going on under the hood when we evaluate our LLM application.
from phoenix.trace import using_project
from ragas import evaluate
from ragas.metrics import (
answer_correctness,
context_precision,
context_recall,
faithfulness,
)
# Log the traces to the project "ragas-evals" just to view
# how Ragas works under the hood
with using_project("ragas-evals"):
evaluation_result = evaluate(
dataset=ragas_eval_dataset,
metrics=[faithfulness, answer_correctness, context_recall, context_precision],
)
eval_scores_df = pd.DataFrame(evaluation_result.scores)
# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans).
span_questions = (
spans_dataframe[["input"]]
.sort_values("input")
.drop_duplicates(subset=["input"], keep="first")
.reset_index()
.rename({"input": "question"}, axis=1)
)
ragas_evals_df = ragas_evals_df.merge(span_questions, on="question").set_index("context.span_id")
test_df = test_df.merge(span_questions, on="question").set_index("context.span_id")
eval_data_df = pd.DataFrame(evaluation_result.dataset)
eval_data_df = eval_data_df.merge(span_questions, on="question").set_index("context.span_id")
eval_scores_df.index = eval_data_df.index
query_embeddings_df = (
query_embeddings_df.sort_values("text")
.drop_duplicates(subset=["text"])
.rename({"text": "question"}, axis=1)
.merge(span_questions, on="question")
.set_index("context.span_id")
)
from phoenix.trace import SpanEvaluations
# Log the evaluations to Phoenix under the project "llama-index"
# This will allow you to visualize the scores alongside the spans in the UI
for eval_name in eval_scores_df.columns:
evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: "score"})
evals = SpanEvaluations(eval_name, evals_df)
px.Client().log_evaluations(evals)
Deep Eval¶
!pip install deepeval
from deepeval.integrations.llama_index import (
DeepEvalAnswerRelevancyEvaluator,
DeepEvalFaithfulnessEvaluator,
DeepEvalContextualRelevancyEvaluator,
DeepEvalSummarizationEvaluator,
DeepEvalBiasEvaluator,
DeepEvalToxicityEvaluator,
)
# An example input to your RAG application
test_questions = test_df["question"].values
for q in tqdm(test_questions):
# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = BASE_RAG_QUERY_ENGINE.query(q)
# Create a list of all evaluators
evaluators = [
DeepEvalAnswerRelevancyEvaluator(model="gpt-4o-mini"),
DeepEvalFaithfulnessEvaluator(model="gpt-4o-mini"),
DeepEvalContextualRelevancyEvaluator(model="gpt-4o-mini"),
DeepEvalSummarizationEvaluator(model="gpt-4o-mini"),
DeepEvalBiasEvaluator(model="gpt-4o-mini"),
DeepEvalToxicityEvaluator(model="gpt-4o-mini"),
]
# Evaluate the response using all evaluators
for evaluator in evaluators:
evaluation_result = evaluator.evaluate_response(
query=q, response=response_object
)
print(f"{evaluator.__class__.__name__} Result:")
print(evaluation_result)
print("\n" + "="*50 + "\n")