Basic RAG from Scratch¶

This notebook implements a basic Retrieval-Augmented Generation (RAG) system from scratch, without relying on external libraries except for essential system-level functionalities. This approach focuses on demonstrating the core concepts of RAG using fundamental Python operations.

Core Steps:

Data Loading: Read text data from a file.
Chunking: Split the text into manageable chunks.
Embedding Simulation: Create simple numerical representations (simulated embeddings).
Semantic Search (Similarity): Implement a basic similarity calculation.
Response Generation (Placeholder): Use a simple string concatenation as a placeholder for LLM response.
Evaluation (Basic String Matching): Evaluate the generated response against a known answer.

Setting Up the Environment¶

We begin by importing necessary libraries.

In [ ]:

  Copied!     
 
import fitz
import numpy as np
import json
import os
from litellm import completion, embedding

# plain openai also can be used
# from openai import OpenAI

# initilize openai client
# client = OpenAI(,
#     api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
# )

# we are using litellm as it allows us to easily switch between different LLM providers
# and is compatible with the same API

# Configure API keys (replace with your actual keys)
os.environ['OPENAI_API_KEY'] = ""  # Replace with your OpenAI API key
os.environ['ANTHROPIC_API_KEY'] = "" # Replace with your Anthropic API key
os.environ['GROQ_API_KEY'] = "" # Replace with your Groq API key
import fitz import numpy as np import json import os from litellm import completion, embedding # plain openai also can be used # from openai import OpenAI # initilize openai client # client = OpenAI(, # api_key=os.getenv("OPENAI_API_KEY") # Retrieve the API key from environment variables # ) # we are using litellm as it allows us to easily switch between different LLM providers # and is compatible with the same API # Configure API keys (replace with your actual keys) os.environ['OPENAI_API_KEY'] = "" # Replace with your OpenAI API key os.environ['ANTHROPIC_API_KEY'] = "" # Replace with your Anthropic API key os.environ['GROQ_API_KEY'] = "" # Replace with your Groq API key 

Extracting Text from a PDF¶

We extract text from a PDF file using PyMuPDF. This process involves opening the PDF, reading its contents, and converting them into a format suitable for further processing.

In [ ]:

  Copied!     
 
def extract_text_from_pdf(pdf_path):
    """
    Extracts and consolidates text from all pages of a PDF file. This is the first step in the RAG pipeline,
    where we acquire the raw textual data that will later be processed, embedded, and retrieved against.

    Args:
    pdf_path (str): Path to the PDF file to be processed.

    Returns:
    str: Complete extracted text from all pages of the PDF, concatenated into a single string.
         This raw text will be further processed in subsequent steps of the RAG pipeline.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text
def extract_text_from_pdf(pdf_path): """ Extracts and consolidates text from all pages of a PDF file. This is the first step in the RAG pipeline, where we acquire the raw textual data that will later be processed, embedded, and retrieved against. Args: pdf_path (str): Path to the PDF file to be processed. Returns: str: Complete extracted text from all pages of the PDF, concatenated into a single string. This raw text will be further processed in subsequent steps of the RAG pipeline. """ # Open the PDF file mypdf = fitz.open(pdf_path) all_text = "" # Initialize an empty string to store the extracted text # Iterate through each page in the PDF for page_num in range(mypdf.page_count): page = mypdf[page_num] # Get the page text = page.get_text("text") # Extract text from the page all_text += text # Append the extracted text to the all_text string return all_text # Return the extracted text

Chunking the Extracted Text¶

Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [ ]:

  Copied!     
 
def chunk_text(text, n, overlap):
    """
    Divides text into smaller, overlapping chunks for more effective processing and retrieval.
    Chunking is a critical step in RAG systems as it:
    1. Makes large documents manageable for embedding models that have token limits
    2. Enables more precise retrieval of relevant information
    3. Allows for contextual understanding within reasonable boundaries
    
    The overlap between chunks helps maintain context continuity and reduces the risk of
    splitting important information across chunk boundaries.

    Args:
    text (str): The complete text to be chunked.
    n (int): The maximum number of characters in each chunk.
    overlap (int): The number of overlapping characters between consecutive chunks.
                   Higher overlap improves context preservation but increases redundancy.

    Returns:
    List[str]: A list of text chunks that will be individually embedded and used for retrieval.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks
def chunk_text(text, n, overlap): """ Divides text into smaller, overlapping chunks for more effective processing and retrieval. Chunking is a critical step in RAG systems as it: 1. Makes large documents manageable for embedding models that have token limits 2. Enables more precise retrieval of relevant information 3. Allows for contextual understanding within reasonable boundaries The overlap between chunks helps maintain context continuity and reduces the risk of splitting important information across chunk boundaries. Args: text (str): The complete text to be chunked. n (int): The maximum number of characters in each chunk. overlap (int): The number of overlapping characters between consecutive chunks. Higher overlap improves context preservation but increases redundancy. Returns: List[str]: A list of text chunks that will be individually embedded and used for retrieval. """ chunks = [] # Initialize an empty list to store the chunks # Loop through the text with a step size of (n - overlap) for i in range(0, len(text), n - overlap): # Append a chunk of text from index i to i + n to the chunks list chunks.append(text[i:i + n]) return chunks # Return the list of text chunks

Extracting and Chunking Text from a PDF File¶

Now, we load the PDF, extract text, and split it into chunks.

In [ ]:

  Copied!     
 
# Define the path to the PDF file
pdf_path = "data/AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])
# Define the path to the PDF file pdf_path = "data/AI_Information.pdf" # Extract text from the PDF file extracted_text = extract_text_from_pdf(pdf_path) # Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters text_chunks = chunk_text(extracted_text, 1000, 200) # Print the number of text chunks created print("Number of text chunks:", len(text_chunks)) # Print the first text chunk print("\nFirst text chunk:") print(text_chunks[0])

Creating Embeddings for Text Chunks¶

Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [ ]:

  Copied!     
 
def create_embeddings(text, model="text-embedding-ada-002"):
    """
    Transforms text into dense vector representations (embeddings) using a neural network model.
    Embeddings are the cornerstone of modern RAG systems because they:
    1. Capture semantic meaning in a numerical format that computers can process
    2. Enable similarity-based retrieval beyond simple keyword matching
    3. Allow for efficient indexing and searching of large document collections
    
    In RAG, both document chunks and user queries are embedded in the same vector space,
    allowing us to find the most semantically relevant chunks for a given query.

    Args:
    text (str or List[str]): The input text(s) to be embedded. Can be a single string or a list of strings.
    model (str): The embedding model to use. Default is OpenAI's "text-embedding-ada-002".
                 Different models offer various tradeoffs between quality, speed, and cost.

    Returns:
    dict: The response from the API containing the embeddings, which are high-dimensional
          vectors representing the semantic content of the input text(s).
    """
    # Create embeddings for the input text using the specified model
    response = embedding(model=model, input=text)

    return response  # Return the response containing the embeddings

# Create embeddings for the text chunks
response = create_embeddings(text_chunks)
def create_embeddings(text, model="text-embedding-ada-002"): """ Transforms text into dense vector representations (embeddings) using a neural network model. Embeddings are the cornerstone of modern RAG systems because they: 1. Capture semantic meaning in a numerical format that computers can process 2. Enable similarity-based retrieval beyond simple keyword matching 3. Allow for efficient indexing and searching of large document collections In RAG, both document chunks and user queries are embedded in the same vector space, allowing us to find the most semantically relevant chunks for a given query. Args: text (str or List[str]): The input text(s) to be embedded. Can be a single string or a list of strings. model (str): The embedding model to use. Default is OpenAI's "text-embedding-ada-002". Different models offer various tradeoffs between quality, speed, and cost. Returns: dict: The response from the API containing the embeddings, which are high-dimensional vectors representing the semantic content of the input text(s). """ # Create embeddings for the input text using the specified model response = embedding(model=model, input=text) return response # Return the response containing the embeddings # Create embeddings for the text chunks response = create_embeddings(text_chunks)

Performing Semantic Search¶

We implement cosine similarity to find the most relevant text chunks for a user query.

In [ ]:

  Copied!     
 
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors, which measures the cosine of the angle between them.
    
    Cosine similarity is particularly well-suited for RAG systems because:
    1. It measures semantic similarity independent of vector magnitude (document length)
    2. It ranges from -1 (completely opposite) to 1 (exactly the same), making it easy to interpret
    3. It works well in high-dimensional spaces like those used for text embeddings
    4. It's computationally efficient compared to some other similarity metrics

    Args:
    vec1 (np.ndarray): The first embedding vector.
    vec2 (np.ndarray): The second embedding vector.

    Returns:
    float: The cosine similarity score between the two vectors, ranging from -1 to 1.
           Higher values indicate greater semantic similarity between the original texts.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def cosine_similarity(vec1, vec2): """ Calculates the cosine similarity between two vectors, which measures the cosine of the angle between them. Cosine similarity is particularly well-suited for RAG systems because: 1. It measures semantic similarity independent of vector magnitude (document length) 2. It ranges from -1 (completely opposite) to 1 (exactly the same), making it easy to interpret 3. It works well in high-dimensional spaces like those used for text embeddings 4. It's computationally efficient compared to some other similarity metrics Args: vec1 (np.ndarray): The first embedding vector. vec2 (np.ndarray): The second embedding vector. Returns: float: The cosine similarity score between the two vectors, ranging from -1 to 1. Higher values indicate greater semantic similarity between the original texts. """ # Compute the dot product of the two vectors and divide by the product of their norms return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [ ]:

  Copied!     
 
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    Performs semantic search to find the most relevant text chunks for a given query.
    This is the core retrieval component of the RAG system, responsible for:
    1. Finding the most semantically relevant information from the knowledge base
    2. Filtering out irrelevant content to improve generation quality
    3. Providing the context that will be used by the LLM for response generation
    
    The quality of retrieval directly impacts the quality of the final generated response,
    as the LLM can only work with the context it's provided.

    Args:
    query (str): The user's question or query text.
    text_chunks (List[str]): The corpus of text chunks to search through.
    embeddings (List[dict]): Pre-computed embeddings for each text chunk.
    k (int): The number of top relevant chunks to retrieve. This parameter balances:
             - Too low: May miss relevant information
             - Too high: May include irrelevant information and exceed context limits

    Returns:
    List[str]: The top k most semantically relevant text chunks for the query,
               which will be used as context for the LLM to generate a response.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # Initialize a list to store similarity scores

    # Calculate similarity scores between the query embedding and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))  # Append the index and similarity score

    # Sort the similarity scores in descending order
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # Get the indices of the top k most similar text chunks
    top_indices = [index for index, _ in similarity_scores[:k]]
    # Return the top k most relevant text chunks
    return [text_chunks[index] for index in top_indices]
def semantic_search(query, text_chunks, embeddings, k=5): """ Performs semantic search to find the most relevant text chunks for a given query. This is the core retrieval component of the RAG system, responsible for: 1. Finding the most semantically relevant information from the knowledge base 2. Filtering out irrelevant content to improve generation quality 3. Providing the context that will be used by the LLM for response generation The quality of retrieval directly impacts the quality of the final generated response, as the LLM can only work with the context it's provided. Args: query (str): The user's question or query text. text_chunks (List[str]): The corpus of text chunks to search through. embeddings (List[dict]): Pre-computed embeddings for each text chunk. k (int): The number of top relevant chunks to retrieve. This parameter balances: - Too low: May miss relevant information - Too high: May include irrelevant information and exceed context limits Returns: List[str]: The top k most semantically relevant text chunks for the query, which will be used as context for the LLM to generate a response. """ # Create an embedding for the query query_embedding = create_embeddings(query).data[0].embedding similarity_scores = [] # Initialize a list to store similarity scores # Calculate similarity scores between the query embedding and each text chunk embedding for i, chunk_embedding in enumerate(embeddings): similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding)) similarity_scores.append((i, similarity_score)) # Append the index and similarity score # Sort the similarity scores in descending order similarity_scores.sort(key=lambda x: x[1], reverse=True) # Get the indices of the top k most similar text chunks top_indices = [index for index, _ in similarity_scores[:k]] # Return the top k most relevant text chunks return [text_chunks[index] for index in top_indices] 

Running a Query on Extracted Chunks¶

In [ ]:

  Copied!     
 
# Load the validation data from a JSON file
with open('data/val.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[0]['question']

# Perform semantic search to find the top 2 most relevant text chunks for the query
top_chunks = semantic_search(query, text_chunks, response.data, k=2)

# Print the query
print("Query:", query)

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")
# Load the validation data from a JSON file with open('data/val.json') as f: data = json.load(f) # Extract the first query from the validation data query = data[0]['question'] # Perform semantic search to find the top 2 most relevant text chunks for the query top_chunks = semantic_search(query, text_chunks, response.data, k=2) # Print the query print("Query:", query) # Print the top 2 most relevant text chunks for i, chunk in enumerate(top_chunks): print(f"Context {i + 1}:\n{chunk}\n=====================================")

Generating a Response Based on Retrieved Chunks¶

In [ ]:

  Copied!     
 
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
    """
    Generates a contextually informed response using an LLM with the retrieved information.
    This is the 'Generation' part of Retrieval-Augmented Generation, where:
    1. The retrieved context is combined with the user query
    2. The LLM synthesizes this information to produce a coherent, accurate response
    3. The system prompt guides the model to stay faithful to the provided context
    
    By using retrieved information as context, the RAG system can:
    - Provide up-to-date information beyond the LLM's training data
    - Cite specific sources for its claims
    - Reduce hallucination by grounding responses in retrieved facts
    - Answer domain-specific questions with greater accuracy

    Args:
    system_prompt (str): Instructions that guide the AI's behavior and response style.
                         In RAG, this typically instructs the model to use only the provided context.
    user_message (str): The combined context and query to be sent to the LLM.
                        This includes both the retrieved text chunks and the original user question.
    model (str): The LLM to use for response generation. Default is "meta-llama/Llama-3.2-3B-Instruct".

    Returns:
    dict: The complete response from the LLM, containing the generated answer based on
          the retrieved context and original query.
    """
    response = completion(model=model, messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ], temperature=0)
    
    # response = client.chat.completions.create(
    #     model=model,
    #     temperature=0,
    #     messages=[
    #         {"role": "system", "content": system_prompt},
    #         {"role": "user", "content": user_message}
    #     ]
    # )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)
# Define the system prompt for the AI assistant system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'" def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"): """ Generates a contextually informed response using an LLM with the retrieved information. This is the 'Generation' part of Retrieval-Augmented Generation, where: 1. The retrieved context is combined with the user query 2. The LLM synthesizes this information to produce a coherent, accurate response 3. The system prompt guides the model to stay faithful to the provided context By using retrieved information as context, the RAG system can: - Provide up-to-date information beyond the LLM's training data - Cite specific sources for its claims - Reduce hallucination by grounding responses in retrieved facts - Answer domain-specific questions with greater accuracy Args: system_prompt (str): Instructions that guide the AI's behavior and response style. In RAG, this typically instructs the model to use only the provided context. user_message (str): The combined context and query to be sent to the LLM. This includes both the retrieved text chunks and the original user question. model (str): The LLM to use for response generation. Default is "meta-llama/Llama-3.2-3B-Instruct". Returns: dict: The complete response from the LLM, containing the generated answer based on the retrieved context and original query. """ response = completion(model=model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ], temperature=0) # response = client.chat.completions.create( # model=model, # temperature=0, # messages=[ # {"role": "system", "content": system_prompt}, # {"role": "user", "content": user_message} # ] # ) return response # Create the user prompt based on the top chunks user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)]) user_prompt = f"{user_prompt}\nQuestion: {query}" # Generate AI response ai_response = generate_response(system_prompt, user_prompt)