Basic RAG from Scratch¶

This notebook implements a basic Retrieval-Augmented Generation (RAG) system from scratch, without relying on external libraries except for essential system-level functionalities. This approach focuses on demonstrating the core concepts of RAG using fundamental Python operations.
Core Steps:
- Data Loading: Read text data from a file.
- Chunking: Split the text into manageable chunks.
- Embedding Simulation: Create simple numerical representations (simulated embeddings).
- Semantic Search (Similarity): Implement a basic similarity calculation.
- Response Generation (Placeholder): Use a simple string concatenation as a placeholder for LLM response.
- Evaluation (Basic String Matching): Evaluate the generated response against a known answer.
Setting Up the Environment¶
We begin by importing necessary libraries.
import fitz
import numpy as np
import json
import os
from litellm import completion, embedding
# plain openai also can be used
# from openai import OpenAI
# initilize openai client
# client = OpenAI(,
# api_key=os.getenv("OPENAI_API_KEY") # Retrieve the API key from environment variables
# )
# we are using litellm as it allows us to easily switch between different LLM providers
# and is compatible with the same API
# Configure API keys (replace with your actual keys)
os.environ['OPENAI_API_KEY'] = "" # Replace with your OpenAI API key
os.environ['ANTHROPIC_API_KEY'] = "" # Replace with your Anthropic API key
os.environ['GROQ_API_KEY'] = "" # Replace with your Groq API key
Extracting Text from a PDF¶
We extract text from a PDF file using PyMuPDF. This process involves opening the PDF, reading its contents, and converting them into a format suitable for further processing.
def extract_text_from_pdf(pdf_path):
"""
Extracts and consolidates text from all pages of a PDF file. This is the first step in the RAG pipeline,
where we acquire the raw textual data that will later be processed, embedded, and retrieved against.
Args:
pdf_path (str): Path to the PDF file to be processed.
Returns:
str: Complete extracted text from all pages of the PDF, concatenated into a single string.
This raw text will be further processed in subsequent steps of the RAG pipeline.
"""
# Open the PDF file
mypdf = fitz.open(pdf_path)
all_text = "" # Initialize an empty string to store the extracted text
# Iterate through each page in the PDF
for page_num in range(mypdf.page_count):
page = mypdf[page_num] # Get the page
text = page.get_text("text") # Extract text from the page
all_text += text # Append the extracted text to the all_text string
return all_text # Return the extracted text
Chunking the Extracted Text¶
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.
def chunk_text(text, n, overlap):
"""
Divides text into smaller, overlapping chunks for more effective processing and retrieval.
Chunking is a critical step in RAG systems as it:
1. Makes large documents manageable for embedding models that have token limits
2. Enables more precise retrieval of relevant information
3. Allows for contextual understanding within reasonable boundaries
The overlap between chunks helps maintain context continuity and reduces the risk of
splitting important information across chunk boundaries.
Args:
text (str): The complete text to be chunked.
n (int): The maximum number of characters in each chunk.
overlap (int): The number of overlapping characters between consecutive chunks.
Higher overlap improves context preservation but increases redundancy.
Returns:
List[str]: A list of text chunks that will be individually embedded and used for retrieval.
"""
chunks = [] # Initialize an empty list to store the chunks
# Loop through the text with a step size of (n - overlap)
for i in range(0, len(text), n - overlap):
# Append a chunk of text from index i to i + n to the chunks list
chunks.append(text[i:i + n])
return chunks # Return the list of text chunks
Extracting and Chunking Text from a PDF File¶
Now, we load the PDF, extract text, and split it into chunks.
# Define the path to the PDF file
pdf_path = "data/AI_Information.pdf"
# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)
# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)
# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))
# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])
Creating Embeddings for Text Chunks¶
Embeddings transform text into numerical vectors, which allow for efficient similarity search.
def create_embeddings(text, model="text-embedding-ada-002"):
"""
Transforms text into dense vector representations (embeddings) using a neural network model.
Embeddings are the cornerstone of modern RAG systems because they:
1. Capture semantic meaning in a numerical format that computers can process
2. Enable similarity-based retrieval beyond simple keyword matching
3. Allow for efficient indexing and searching of large document collections
In RAG, both document chunks and user queries are embedded in the same vector space,
allowing us to find the most semantically relevant chunks for a given query.
Args:
text (str or List[str]): The input text(s) to be embedded. Can be a single string or a list of strings.
model (str): The embedding model to use. Default is OpenAI's "text-embedding-ada-002".
Different models offer various tradeoffs between quality, speed, and cost.
Returns:
dict: The response from the API containing the embeddings, which are high-dimensional
vectors representing the semantic content of the input text(s).
"""
# Create embeddings for the input text using the specified model
response = embedding(model=model, input=text)
return response # Return the response containing the embeddings
# Create embeddings for the text chunks
response = create_embeddings(text_chunks)
Performing Semantic Search¶
We implement cosine similarity to find the most relevant text chunks for a user query.
def cosine_similarity(vec1, vec2):
"""
Calculates the cosine similarity between two vectors, which measures the cosine of the angle between them.
Cosine similarity is particularly well-suited for RAG systems because:
1. It measures semantic similarity independent of vector magnitude (document length)
2. It ranges from -1 (completely opposite) to 1 (exactly the same), making it easy to interpret
3. It works well in high-dimensional spaces like those used for text embeddings
4. It's computationally efficient compared to some other similarity metrics
Args:
vec1 (np.ndarray): The first embedding vector.
vec2 (np.ndarray): The second embedding vector.
Returns:
float: The cosine similarity score between the two vectors, ranging from -1 to 1.
Higher values indicate greater semantic similarity between the original texts.
"""
# Compute the dot product of the two vectors and divide by the product of their norms
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def semantic_search(query, text_chunks, embeddings, k=5):
"""
Performs semantic search to find the most relevant text chunks for a given query.
This is the core retrieval component of the RAG system, responsible for:
1. Finding the most semantically relevant information from the knowledge base
2. Filtering out irrelevant content to improve generation quality
3. Providing the context that will be used by the LLM for response generation
The quality of retrieval directly impacts the quality of the final generated response,
as the LLM can only work with the context it's provided.
Args:
query (str): The user's question or query text.
text_chunks (List[str]): The corpus of text chunks to search through.
embeddings (List[dict]): Pre-computed embeddings for each text chunk.
k (int): The number of top relevant chunks to retrieve. This parameter balances:
- Too low: May miss relevant information
- Too high: May include irrelevant information and exceed context limits
Returns:
List[str]: The top k most semantically relevant text chunks for the query,
which will be used as context for the LLM to generate a response.
"""
# Create an embedding for the query
query_embedding = create_embeddings(query).data[0].embedding
similarity_scores = [] # Initialize a list to store similarity scores
# Calculate similarity scores between the query embedding and each text chunk embedding
for i, chunk_embedding in enumerate(embeddings):
similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
similarity_scores.append((i, similarity_score)) # Append the index and similarity score
# Sort the similarity scores in descending order
similarity_scores.sort(key=lambda x: x[1], reverse=True)
# Get the indices of the top k most similar text chunks
top_indices = [index for index, _ in similarity_scores[:k]]
# Return the top k most relevant text chunks
return [text_chunks[index] for index in top_indices]
Running a Query on Extracted Chunks¶
# Load the validation data from a JSON file
with open('data/val.json') as f:
data = json.load(f)
# Extract the first query from the validation data
query = data[0]['question']
# Perform semantic search to find the top 2 most relevant text chunks for the query
top_chunks = semantic_search(query, text_chunks, response.data, k=2)
# Print the query
print("Query:", query)
# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
print(f"Context {i + 1}:\n{chunk}\n=====================================")
Generating a Response Based on Retrieved Chunks¶
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"
def generate_response(system_prompt, user_message, model="meta-llama/Llama-3.2-3B-Instruct"):
"""
Generates a contextually informed response using an LLM with the retrieved information.
This is the 'Generation' part of Retrieval-Augmented Generation, where:
1. The retrieved context is combined with the user query
2. The LLM synthesizes this information to produce a coherent, accurate response
3. The system prompt guides the model to stay faithful to the provided context
By using retrieved information as context, the RAG system can:
- Provide up-to-date information beyond the LLM's training data
- Cite specific sources for its claims
- Reduce hallucination by grounding responses in retrieved facts
- Answer domain-specific questions with greater accuracy
Args:
system_prompt (str): Instructions that guide the AI's behavior and response style.
In RAG, this typically instructs the model to use only the provided context.
user_message (str): The combined context and query to be sent to the LLM.
This includes both the retrieved text chunks and the original user question.
model (str): The LLM to use for response generation. Default is "meta-llama/Llama-3.2-3B-Instruct".
Returns:
dict: The complete response from the LLM, containing the generated answer based on
the retrieved context and original query.
"""
response = completion(model=model, messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
], temperature=0)
# response = client.chat.completions.create(
# model=model,
# temperature=0,
# messages=[
# {"role": "system", "content": system_prompt},
# {"role": "user", "content": user_message}
# ]
# )
return response
# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"
# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)