Notebook

Evaluating RAG

AI Engineering.academy

Introduction¶

Evaluation is a critical component in the development and optimization of Retrieval-Augmented Generation (RAG) systems. It involves assessing the performance, accuracy, and quality of various aspects of the RAG pipeline, from retrieval effectiveness to the relevance and faithfulness of generated responses.

Importance of Evaluation in RAG¶

Effective evaluation of RAG systems is essential because it:

Helps identify strengths and weaknesses in the retrieval and generation processes.
Guides improvements and optimizations across the RAG pipeline.
Ensures the system meets quality standards and user expectations.
Facilitates comparison between different RAG implementations or configurations.
Helps detect issues such as hallucinations, biases, or irrelevant responses.

Key Evaluation Metrics¶

RAGAS Metrics¶

Faithfulness: Measures how well the generated response aligns with the retrieved context.
Answer Relevancy: Assesses the relevance of the response to the query.
Context Recall: Evaluates how well the retrieved chunks cover the information needed to answer the query.
Context Precision: Measures the proportion of relevant information in the retrieved chunks.
Context Utilization: Assesses how effectively the generated response uses the provided context.
Context Entity Recall: Evaluates the coverage of important entities from the context in the response.
Noise Sensitivity: Measures the system's robustness to irrelevant or noisy information.
Summarization Score: Assesses the quality of summarization in the response.

DeepEval Metrics¶

G-Eval: A general evaluation metric for text generation tasks.
Summarization: Assesses the quality of text summarization.
Answer Relevancy: Measures how well the response answers the query.
Faithfulness: Evaluates the accuracy of the response with respect to the source information.
Contextual Recall and Precision: Measures the effectiveness of context retrieval.
Hallucination: Detects fabricated or inaccurate information in the response.
Toxicity: Identifies harmful or offensive content in the response.
Bias: Detects unfair prejudice or favoritism in the generated content.

Trulens Metrics¶

Context Relevance: Assesses how well the retrieved context matches the query.
Groundedness: Measures how well the response is supported by the retrieved information.
Answer Relevance: Evaluates how well the response addresses the query.
Comprehensiveness: Assesses the completeness of the response.
Harmful/Toxic Language: Identifies potentially offensive or dangerous content.
User Sentiment: Analyzes the emotional tone of user interactions.
Language Mismatch: Detects inconsistencies in language use between query and response.
Fairness and Bias: Evaluates the system for equitable treatment across different groups.
Custom Feedback Functions: Allows for tailored evaluation metrics specific to use cases.

Best Practices for RAG Evaluation¶

Comprehensive Evaluation: Use a combination of metrics to assess different aspects of the RAG system.
Regular Benchmarking: Continuously evaluate the system as changes are made to the pipeline.
Human-in-the-Loop: Incorporate human evaluation alongside automated metrics for a holistic assessment.
Domain-Specific Metrics: Develop custom metrics relevant to your specific use case or domain.
Error Analysis: Investigate patterns in low-scoring responses to identify areas for improvement.
Comparative Evaluation: Benchmark your RAG system against baseline models and alternative implementations.

Conclusion¶

A robust evaluation framework is crucial for developing and maintaining high-quality RAG systems. By leveraging a diverse set of metrics and following best practices, developers can ensure their RAG systems deliver accurate, relevant, and trustworthy responses while continuously improving performance.

Setting up the Environment¶

In [ ]:

  Copied!     
 
!pip install llama-index
!pip install llama-index-vector-stores-qdrant 
!pip install llama-index-readers-file 
!pip install llama-index-embeddings-fastembed 
!pip install llama-index-llms-openai
!pip install llama-index-llms-groq
!pip install -U qdrant_client fastembed
!pip install python-dotenv
!pip install llama-index !pip install llama-index-vector-stores-qdrant !pip install llama-index-readers-file !pip install llama-index-embeddings-fastembed !pip install llama-index-llms-openai !pip install llama-index-llms-groq !pip install -U qdrant_client fastembed !pip install python-dotenv

In [ ]:

  Copied!     
 
# Standard library imports
import logging
import sys
import os

# Third-party imports
from dotenv import load_dotenv
from IPython.display import Markdown, display

# Qdrant client import
import qdrant_client

# LlamaIndex core imports
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

# LlamaIndex vector store import
from llama_index.vector_stores.qdrant import QdrantVectorStore

# Embedding model imports
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding

# LLM import
from llama_index.llms.openai import OpenAI
# from llama_index.llms.groq import Groq
# Load environment variables
load_dotenv()

# Get OpenAI API key from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GROK_API_KEY = os.getenv("GROQ_API_KEY")

# Setting up Base LLM
Settings.llm = OpenAI(
    model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True
)

# Settings.llm = Groq(model="llama3-70b-8192" , api_key=GROK_API_KEY)

# Set the embedding model
# Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default)
# Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")

# Option 2: Use OpenAI's embedding model (commented out)
# If you want to use OpenAI's embedding model, uncomment the following line:
Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY)

# Qdrant configuration (commented out)
# If you're using Qdrant, uncomment and set these variables:
# QDRANT_CLOUD_ENDPOINT = os.getenv("QDRANT_CLOUD_ENDPOINT")
# QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

# Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version
# Standard library imports import logging import sys import os # Third-party imports from dotenv import load_dotenv from IPython.display import Markdown, display # Qdrant client import import qdrant_client # LlamaIndex core imports from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core import Settings # LlamaIndex vector store import from llama_index.vector_stores.qdrant import QdrantVectorStore # Embedding model imports from llama_index.embeddings.fastembed import FastEmbedEmbedding from llama_index.embeddings.openai import OpenAIEmbedding # LLM import from llama_index.llms.openai import OpenAI # from llama_index.llms.groq import Groq # Load environment variables load_dotenv() # Get OpenAI API key from environment variables OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") GROK_API_KEY = os.getenv("GROQ_API_KEY") # Setting up Base LLM Settings.llm = OpenAI( model="gpt-4o-mini", temperature=0.1, max_tokens=1024, streaming=True ) # Settings.llm = Groq(model="llama3-70b-8192" , api_key=GROK_API_KEY) # Set the embedding model # Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default) # Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5") # Option 2: Use OpenAI's embedding model (commented out) # If you want to use OpenAI's embedding model, uncomment the following line: Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY) # Qdrant configuration (commented out) # If you're using Qdrant, uncomment and set these variables: # QDRANT_CLOUD_ENDPOINT = os.getenv("QDRANT_CLOUD_ENDPOINT") # QDRANT_API_KEY = os.getenv("QDRANT_API_KEY") # Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version

Load the Data¶

In [ ]:

  Copied!     
 
# lets loading the documents using SimpleDirectoryReader

print("🔃 Loading Data")

from llama_index.core import Document
reader = SimpleDirectoryReader("../data/" , recursive=True)
documents = reader.load_data(show_progress=True)
# lets loading the documents using SimpleDirectoryReader print("🔃 Loading Data") from llama_index.core import Document reader = SimpleDirectoryReader("../data/" , recursive=True) documents = reader.load_data(show_progress=True)

Setting up Vector Database¶

We will be using qDrant as the Vector database There are 4 ways to initialize qdrant

Inmemory

client = qdrant_client.QdrantClient(location=":memory:")

Disk

client = qdrant_client.QdrantClient(path="./data")

Self hosted or Docker

client = qdrant_client.QdrantClient(
    # url="http://<host>:<port>"
    host="localhost",port=6333
)

Qdrant cloud

client = qdrant_client.QdrantClient(
    url=QDRANT_CLOUD_ENDPOINT,
    api_key=QDRANT_API_KEY,
)

for this notebook we will be using qdrant cloud

In [ ]:

  Copied!     
 
# creating a qdrant client instance

client = qdrant_client.QdrantClient(
    # you can use :memory: mode for fast and light-weight experiments,
    # it does not require to have Qdrant deployed anywhere
    # but requires qdrant-client >= 1.1.1
    # location=":memory:"
    # otherwise set Qdrant instance address with:
    # url=QDRANT_CLOUD_ENDPOINT,
    # otherwise set Qdrant instance with host and port:
    host="localhost",
    port=6333
    # set API KEY for Qdrant Cloud
    # api_key=QDRANT_API_KEY,
    # path="./db/"
)

vector_store = QdrantVectorStore(client=client, collection_name="01_RAG_Evaluation")
# creating a qdrant client instance client = qdrant_client.QdrantClient( # you can use :memory: mode for fast and light-weight experiments, # it does not require to have Qdrant deployed anywhere # but requires qdrant-client >= 1.1.1 # location=":memory:" # otherwise set Qdrant instance address with: # url=QDRANT_CLOUD_ENDPOINT, # otherwise set Qdrant instance with host and port: host="localhost", port=6333 # set API KEY for Qdrant Cloud # api_key=QDRANT_API_KEY, # path="./db/" ) vector_store = QdrantVectorStore(client=client, collection_name="01_RAG_Evaluation")

Ingest Data into vector DB¶

In [ ]:

  Copied!     
 
## ingesting data into vector database

## lets set up an ingestion pipeline

from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[
        # MarkdownNodeParser(include_metadata=True),
        # TokenTextSplitter(chunk_size=500, chunk_overlap=20),
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        # SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),
        Settings.embed_model,
    ],
    vector_store=vector_store,
)

# Ingest directly into a vector db
nodes = pipeline.run(documents=documents , show_progress=True)
print("Number of chunks added to vector DB :",len(nodes))
## ingesting data into vector database ## lets set up an ingestion pipeline from llama_index.core.node_parser import TokenTextSplitter from llama_index.core.node_parser import SentenceSplitter from llama_index.core.node_parser import MarkdownNodeParser from llama_index.core.node_parser import SemanticSplitterNodeParser from llama_index.core.ingestion import IngestionPipeline pipeline = IngestionPipeline( transformations=[ # MarkdownNodeParser(include_metadata=True), # TokenTextSplitter(chunk_size=500, chunk_overlap=20), SentenceSplitter(chunk_size=1024, chunk_overlap=20), # SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model), Settings.embed_model, ], vector_store=vector_store, ) # Ingest directly into a vector db nodes = pipeline.run(documents=documents , show_progress=True) print("Number of chunks added to vector DB :",len(nodes))

Setting Up Index¶

In [4]:

  Copied!     
 
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

Modifying Prompts and Prompt Tuning¶

In [5]:

  Copied!     
 
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

refine_prompt_str = (
    "We have the opportunity to refine the original answer "
    "(only if needed) with some more context below.\n"
    "------------\n"
    "{context_msg}\n"
    "------------\n"
    "Given the new context, refine the original answer to better "
    "answer the question: {query_str}. "
    "If the context isn't useful, output the original answer again.\n"
    "Original Answer: {existing_answer}"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ("system","You are a AI assistant who is well versed with answering questions from the provided context"),
    ("user", qa_prompt_str),
]
text_qa_template = ChatPromptTemplate.from_messages(chat_text_qa_msgs)

# Refine Prompt
chat_refine_msgs = [
    ("system","Always answer the question, even if the context isn't helpful.",),
    ("user", refine_prompt_str),
]
refine_template = ChatPromptTemplate.from_messages(chat_refine_msgs)
from llama_index.core import ChatPromptTemplate qa_prompt_str = ( "Context information is below.\n" "---------------------\n" "{context_str}\n" "---------------------\n" "Given the context information and not prior knowledge, " "answer the question: {query_str}\n" ) refine_prompt_str = ( "We have the opportunity to refine the original answer " "(only if needed) with some more context below.\n" "------------\n" "{context_msg}\n" "------------\n" "Given the new context, refine the original answer to better " "answer the question: {query_str}. " "If the context isn't useful, output the original answer again.\n" "Original Answer: {existing_answer}" ) # Text QA Prompt chat_text_qa_msgs = [ ("system","You are a AI assistant who is well versed with answering questions from the provided context"), ("user", qa_prompt_str), ] text_qa_template = ChatPromptTemplate.from_messages(chat_text_qa_msgs) # Refine Prompt chat_refine_msgs = [ ("system","Always answer the question, even if the context isn't helpful.",), ("user", refine_prompt_str), ] refine_template = ChatPromptTemplate.from_messages(chat_refine_msgs)

Example of Retrivers¶

Query Engine
Chat Engine

In [ ]:

  Copied!     
 
# Setting up Query Engine
BASE_RAG_QUERY_ENGINE = index.as_query_engine(
        similarity_top_k=5,
        text_qa_template=text_qa_template,
        refine_template=refine_template,)

response = BASE_RAG_QUERY_ENGINE.query("How many encoders are stacked in the encoder?")
display(Markdown(str(response)))
# Setting up Query Engine BASE_RAG_QUERY_ENGINE = index.as_query_engine( similarity_top_k=5, text_qa_template=text_qa_template, refine_template=refine_template,) response = BASE_RAG_QUERY_ENGINE.query("How many encoders are stacked in the encoder?") display(Markdown(str(response)))

In [ ]:

  Copied!     
 
# Setting up Chat Engine
BASE_RAG_CHAT_ENGINE = index.as_chat_engine()

response = BASE_RAG_CHAT_ENGINE.chat("How many encoders are stacked in the encoder?")
display(Markdown(str(response)))
# Setting up Chat Engine BASE_RAG_CHAT_ENGINE = index.as_chat_engine() response = BASE_RAG_CHAT_ENGINE.chat("How many encoders are stacked in the encoder?") display(Markdown(str(response)))

Setup Observability¶

In [ ]:

  Copied!     
 
!pip install arize-phoenix
!pip install openinference-instrumentation-llama-index
!pip install -U llama-index-callbacks-arize-phoenix
!pip install arize-phoenix !pip install openinference-instrumentation-llama-index !pip install -U llama-index-callbacks-arize-phoenix

In [1]:

  Copied!     
 
import phoenix as px

# (session := px.launch_app()).view()
import phoenix as px # (session := px.launch_app()).view()

In [2]:

  Copied!     
 
from openinference.instrumentation.langchain import LangChainInstrumentor
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import SpanLimits, TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

endpoint = "http://127.0.0.1:6006/v1/traces"
tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000))
tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint)))

LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
# LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)
from openinference.instrumentation.langchain import LangChainInstrumentor from openinference.instrumentation.llama_index import LlamaIndexInstrumentor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import SpanLimits, TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor endpoint = "http://127.0.0.1:6006/v1/traces" tracer_provider = TracerProvider(span_limits=SpanLimits(max_attributes=100_000)) tracer_provider.add_span_processor(SimpleSpanProcessor(OTLPSpanExporter(endpoint))) LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider) # LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)

Generating Test Dataset¶

Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.

In [ ]:

  Copied!     
 
!pip install ragas
!pip install ragas

In [ ]:

  Copied!     
 
import os
import pandas as pd
from phoenix.trace import using_project
from ragas.testset.evolutions import multi_context, reasoning, simple
from ragas.testset.generator import TestsetGenerator

TEST_SIZE = 5
CACHE_FILE = "eval_testset.csv"

def generate_and_save_testset():
    # Generator with openai models
    generator = TestsetGenerator.with_openai()

    # Set question type distribution
    distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}

    # Generate testset
    with using_project("ragas-testset"):
        testset = generator.generate_with_llamaindex_docs(
            documents, test_size=TEST_SIZE, distributions=distribution
        )
    
    test_df = (
        testset.to_pandas()
        .sort_values("question")
        .drop_duplicates(subset=["question"], keep="first")
    )
    
    # Save the dataset locally
    test_df.to_csv(CACHE_FILE, index=False)
    print(f"Test dataset saved to {CACHE_FILE}")
    
    return test_df

def load_or_generate_testset():
    if os.path.exists(CACHE_FILE):
        print(f"Loading existing test dataset from {CACHE_FILE}")
        test_df = pd.read_csv(CACHE_FILE)
    else:
        print("Generating new test dataset...")
        test_df = generate_and_save_testset()
    
    return test_df

# Main execution
test_df = load_or_generate_testset()
print(test_df.head(5))
import os import pandas as pd from phoenix.trace import using_project from ragas.testset.evolutions import multi_context, reasoning, simple from ragas.testset.generator import TestsetGenerator TEST_SIZE = 5 CACHE_FILE = "eval_testset.csv" def generate_and_save_testset(): # Generator with openai models generator = TestsetGenerator.with_openai() # Set question type distribution distribution = {simple: 0.5, reasoning: 0.25, multi_context: 0.25} # Generate testset with using_project("ragas-testset"): testset = generator.generate_with_llamaindex_docs( documents, test_size=TEST_SIZE, distributions=distribution ) test_df = ( testset.to_pandas() .sort_values("question") .drop_duplicates(subset=["question"], keep="first") ) # Save the dataset locally test_df.to_csv(CACHE_FILE, index=False) print(f"Test dataset saved to {CACHE_FILE}") return test_df def load_or_generate_testset(): if os.path.exists(CACHE_FILE): print(f"Loading existing test dataset from {CACHE_FILE}") test_df = pd.read_csv(CACHE_FILE) else: print("Generating new test dataset...") test_df = generate_and_save_testset() return test_df # Main execution test_df = load_or_generate_testset() print(test_df.head(5))

You are free to change the question type distribution according to your needs. Since we now have our test dataset ready, let’s move on and build a simple RAG pipeline using LlamaIndex.

RAGAS¶

In [ ]:

  Copied!     
 
from phoenix.trace.dsl.helpers import SpanQuery

client = px.Client()
corpus_df = px.Client().query_spans(
    SpanQuery().explode(
        "embedding.embeddings",
        text="embedding.text",
        vector="embedding.vector",
    ),
    project_name="indexing",
)
corpus_df.head(2)
from phoenix.trace.dsl.helpers import SpanQuery client = px.Client() corpus_df = px.Client().query_spans( SpanQuery().explode( "embedding.embeddings", text="embedding.text", vector="embedding.vector", ), project_name="indexing", ) corpus_df.head(2)

In [ ]:

  Copied!     
 
import pandas as pd
from datasets import Dataset
from phoenix.trace import using_project
from tqdm.auto import tqdm


def generate_response(query_engine, question):
    response = query_engine.query(question)
    return {
        "answer": response.response,
        "contexts": [c.node.get_content() for c in response.source_nodes],
    }


def generate_ragas_dataset(query_engine, test_df):
    test_questions = test_df["question"].values
    responses = [generate_response(query_engine, q) for q in tqdm(test_questions)]

    dataset_dict = {
        "question": test_questions,
        "answer": [response["answer"] for response in responses],
        "contexts": [response["contexts"] for response in responses],
        "ground_truth": test_df["ground_truth"].values.tolist(),
    }
    ds = Dataset.from_dict(dataset_dict)
    return ds


with using_project("llama-index"):
    ragas_eval_dataset = generate_ragas_dataset(BASE_RAG_QUERY_ENGINE, test_df)

ragas_evals_df = pd.DataFrame(ragas_eval_dataset)
ragas_evals_df.head(2)
import pandas as pd from datasets import Dataset from phoenix.trace import using_project from tqdm.auto import tqdm def generate_response(query_engine, question): response = query_engine.query(question) return { "answer": response.response, "contexts": [c.node.get_content() for c in response.source_nodes], } def generate_ragas_dataset(query_engine, test_df): test_questions = test_df["question"].values responses = [generate_response(query_engine, q) for q in tqdm(test_questions)] dataset_dict = { "question": test_questions, "answer": [response["answer"] for response in responses], "contexts": [response["contexts"] for response in responses], "ground_truth": test_df["ground_truth"].values.tolist(), } ds = Dataset.from_dict(dataset_dict) return ds with using_project("llama-index"): ragas_eval_dataset = generate_ragas_dataset(BASE_RAG_QUERY_ENGINE, test_df) ragas_evals_df = pd.DataFrame(ragas_eval_dataset) ragas_evals_df.head(2)

In [ ]:

  Copied!     
 
from phoenix.trace.dsl.helpers import SpanQuery
# dataset containing embeddings for visualization
query_embeddings_df = px.Client().query_spans(
    SpanQuery().explode("embedding.embeddings", text="embedding.text", vector="embedding.vector"),
    project_name="llama-index",
)
query_embeddings_df.head(2)
from phoenix.trace.dsl.helpers import SpanQuery # dataset containing embeddings for visualization query_embeddings_df = px.Client().query_spans( SpanQuery().explode("embedding.embeddings", text="embedding.text", vector="embedding.vector"), project_name="llama-index", ) query_embeddings_df.head(2)

In [ ]:

  Copied!     
 
from phoenix.session.evaluation import get_qa_with_reference

# dataset containing span data for evaluation with Ragas
spans_dataframe = get_qa_with_reference(client, project_name="llama-index")
spans_dataframe.head(2)
from phoenix.session.evaluation import get_qa_with_reference # dataset containing span data for evaluation with Ragas spans_dataframe = get_qa_with_reference(client, project_name="llama-index") spans_dataframe.head(2)

Ragas uses LangChain to evaluate your LLM application data. Since we initialized the LangChain instrumentation above we can see what's going on under the hood when we evaluate our LLM application.

In [ ]:

  Copied!     
 
from phoenix.trace import using_project
from ragas import evaluate
from ragas.metrics import (
    answer_correctness,
    context_precision,
    context_recall,
    faithfulness,
)

# Log the traces to the project "ragas-evals" just to view
# how Ragas works under the hood
with using_project("ragas-evals"):
    evaluation_result = evaluate(
        dataset=ragas_eval_dataset,
        metrics=[faithfulness, answer_correctness, context_recall, context_precision],
    )
eval_scores_df = pd.DataFrame(evaluation_result.scores)
from phoenix.trace import using_project from ragas import evaluate from ragas.metrics import ( answer_correctness, context_precision, context_recall, faithfulness, ) # Log the traces to the project "ragas-evals" just to view # how Ragas works under the hood with using_project("ragas-evals"): evaluation_result = evaluate( dataset=ragas_eval_dataset, metrics=[faithfulness, answer_correctness, context_recall, context_precision], ) eval_scores_df = pd.DataFrame(evaluation_result.scores)

In [21]:

  Copied!     
 
# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans).
span_questions = (
    spans_dataframe[["input"]]
    .sort_values("input")
    .drop_duplicates(subset=["input"], keep="first")
    .reset_index()
    .rename({"input": "question"}, axis=1)
)
ragas_evals_df = ragas_evals_df.merge(span_questions, on="question").set_index("context.span_id")
test_df = test_df.merge(span_questions, on="question").set_index("context.span_id")
eval_data_df = pd.DataFrame(evaluation_result.dataset)
eval_data_df = eval_data_df.merge(span_questions, on="question").set_index("context.span_id")
eval_scores_df.index = eval_data_df.index

query_embeddings_df = (
    query_embeddings_df.sort_values("text")
    .drop_duplicates(subset=["text"])
    .rename({"text": "question"}, axis=1)
    .merge(span_questions, on="question")
    .set_index("context.span_id")
)
# Assign span ids to your ragas evaluation scores (needed so Phoenix knows where to attach the spans). span_questions = ( spans_dataframe[["input"]] .sort_values("input") .drop_duplicates(subset=["input"], keep="first") .reset_index() .rename({"input": "question"}, axis=1) ) ragas_evals_df = ragas_evals_df.merge(span_questions, on="question").set_index("context.span_id") test_df = test_df.merge(span_questions, on="question").set_index("context.span_id") eval_data_df = pd.DataFrame(evaluation_result.dataset) eval_data_df = eval_data_df.merge(span_questions, on="question").set_index("context.span_id") eval_scores_df.index = eval_data_df.index query_embeddings_df = ( query_embeddings_df.sort_values("text") .drop_duplicates(subset=["text"]) .rename({"text": "question"}, axis=1) .merge(span_questions, on="question") .set_index("context.span_id") )

In [ ]:

  Copied!     
 
from phoenix.trace import SpanEvaluations

# Log the evaluations to Phoenix under the project "llama-index"
# This will allow you to visualize the scores alongside the spans in the UI
for eval_name in eval_scores_df.columns:
    evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: "score"})
    evals = SpanEvaluations(eval_name, evals_df)
    px.Client().log_evaluations(evals)
from phoenix.trace import SpanEvaluations # Log the evaluations to Phoenix under the project "llama-index" # This will allow you to visualize the scores alongside the spans in the UI for eval_name in eval_scores_df.columns: evals_df = eval_scores_df[[eval_name]].rename(columns={eval_name: "score"}) evals = SpanEvaluations(eval_name, evals_df) px.Client().log_evaluations(evals)

Deep Eval¶

In [ ]:

  Copied!     
 
!pip install deepeval
!pip install deepeval

In [ ]:

  Copied!     
 
from deepeval.integrations.llama_index import (
    DeepEvalAnswerRelevancyEvaluator,
    DeepEvalFaithfulnessEvaluator,
    DeepEvalContextualRelevancyEvaluator,
    DeepEvalSummarizationEvaluator,
    DeepEvalBiasEvaluator,
    DeepEvalToxicityEvaluator,
)

# An example input to your RAG application
test_questions = test_df["question"].values
for q in tqdm(test_questions):

    # LlamaIndex returns a response object that contains
    # both the output string and retrieved nodes
    response_object = BASE_RAG_QUERY_ENGINE.query(q)

    # Create a list of all evaluators
    evaluators = [
        DeepEvalAnswerRelevancyEvaluator(model="gpt-4o-mini"),
        DeepEvalFaithfulnessEvaluator(model="gpt-4o-mini"),
        DeepEvalContextualRelevancyEvaluator(model="gpt-4o-mini"),
        DeepEvalSummarizationEvaluator(model="gpt-4o-mini"),
        DeepEvalBiasEvaluator(model="gpt-4o-mini"),
        DeepEvalToxicityEvaluator(model="gpt-4o-mini"),
    ]

    # Evaluate the response using all evaluators
    for evaluator in evaluators:
        evaluation_result = evaluator.evaluate_response(
            query=q, response=response_object
        )
        print(f"{evaluator.__class__.__name__} Result:")
        print(evaluation_result)
        print("\n" + "="*50 + "\n") 
 from deepeval.integrations.llama_index import ( DeepEvalAnswerRelevancyEvaluator, DeepEvalFaithfulnessEvaluator, DeepEvalContextualRelevancyEvaluator, DeepEvalSummarizationEvaluator, DeepEvalBiasEvaluator, DeepEvalToxicityEvaluator, ) # An example input to your RAG application test_questions = test_df["question"].values for q in tqdm(test_questions): # LlamaIndex returns a response object that contains # both the output string and retrieved nodes response_object = BASE_RAG_QUERY_ENGINE.query(q) # Create a list of all evaluators evaluators = [ DeepEvalAnswerRelevancyEvaluator(model="gpt-4o-mini"), DeepEvalFaithfulnessEvaluator(model="gpt-4o-mini"), DeepEvalContextualRelevancyEvaluator(model="gpt-4o-mini"), DeepEvalSummarizationEvaluator(model="gpt-4o-mini"), DeepEvalBiasEvaluator(model="gpt-4o-mini"), DeepEvalToxicityEvaluator(model="gpt-4o-mini"), ] # Evaluate the response using all evaluators for evaluator in evaluators: evaluation_result = evaluator.evaluate_response( query=q, response=response_object ) print(f"{evaluator.__class__.__name__} Result:") print(evaluation_result) print("\n" + "="*50 + "\n")