Implementation
BM25 RAG
Introduction¶
BM25 Retrieval-Augmented Generation (BM25 RAG) is an advanced technique that combines the power of the BM25 (Best Matching 25) algorithm for information retrieval with large language models for text generation. This approach enhances the accuracy and relevance of generated responses by grounding them in specific, retrieved information using a proven probabilistic retrieval model.
This notebook aims to provide a clear and concise introduction to BM25 RAG, suitable for both beginners and experienced practitioners who want to understand and implement this technology.
Motivation¶
Traditional RAG systems often use dense vector embeddings for retrieval, which can be computationally expensive and may not always capture the nuances of term importance. BM25 RAG addresses these limitations by using a probabilistic retrieval model that considers term frequency, inverse document frequency, and document length. This approach can lead to more accurate and interpretable retrieval, especially for queries requiring specific or rare information.
Method Details¶
Document Preprocessing and Indexing¶
Document Chunking: The knowledge base documents are preprocessed and split into manageable chunks to create a searchable corpus.
Tokenization and Indexing: Each chunk is tokenized, and an inverted index is created. The BM25 algorithm calculates term frequencies and inverse document frequencies.
BM25 Retrieval-Augmented Generation Workflow¶
Query Input: A user provides a query that needs to be answered.
Retrieval Step: The query is tokenized, and relevant documents are retrieved using the BM25 scoring algorithm. This step considers term frequency, inverse document frequency, and document length to find the most relevant chunks.
Generation Step: The retrieved document chunks are passed to a large language model as additional context. The model uses this context to generate a more accurate and relevant response.
Key Features of BM25 RAG¶
Probabilistic Retrieval: BM25 uses a probabilistic model to rank documents, providing a theoretically sound basis for retrieval.
Term Frequency Saturation: BM25 accounts for diminishing returns from repeated terms, improving retrieval quality.
Document Length Normalization: The algorithm considers document length, reducing bias towards longer documents.
No Embedding Required: Unlike vector-based approaches, BM25 doesn't require document embeddings, which can be computationally efficient.
Benefits of this Approach¶
Improved Accuracy: Combines the strengths of probabilistic retrieval and neural text generation.
Interpretability: BM25 scoring provides a more interpretable retrieval process compared to dense vector retrieval methods.
Effective for Long-tail Queries: Particularly good at handling queries requiring specific or rare information.
Conclusion¶
BM25 Retrieval-Augmented Generation represents a powerful fusion of classic information retrieval techniques and modern language models. By leveraging the strengths of the BM25 algorithm, this approach offers improved accuracy, interpretability, and efficiency in various natural language processing tasks. As AI continues to evolve, BM25 RAG stands out as a robust method for building more reliable and context-sensitive AI systems, especially in domains where precise information retrieval is crucial.
Prerequisites¶
- Preferably Python 3.11
- Jupyter Notebook or JupyterLab
- LLM API Key
- You can use any LLM of your choice; in this notebook, we use OpenAI's GPT models
With these steps, you can implement a BM25 RAG system to enhance the capabilities of language models by incorporating efficient, probabilistic information retrieval, improving their effectiveness in various applications.
# !pip install llama-index
# !pip install llama-index-retrievers-bm25
# !pip install llama-index-vector-stores-qdrant
# !pip install llama-index-readers-file
# !pip install llama-index-embeddings-fastembed
# !pip install llama-index-llms-openai
# !pip install llama-index-llms-groq
# !pip install -U qdrant_client fastembed
# !pip install python-dotenv
# !pip install matplotlib
# Standard library imports
import logging
import sys
import os
# Third-party imports
from dotenv import load_dotenv
from IPython.display import Markdown, display
# Qdrant client import
import qdrant_client
# LlamaIndex core imports
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
# LlamaIndex vector store import
from llama_index.vector_stores.qdrant import QdrantVectorStore
# Embedding model imports
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
# LLM import
from llama_index.llms.openai import OpenAI
from llama_index.llms.groq import Groq
# Load environment variables
load_dotenv()
# Get OpenAI API key from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# GROK_API_KEY = os.getenv("GROQ_API_KEY")
# Setting up Base LLM
Settings.llm = OpenAI(
model="gpt-4o-mini", temperature=0.1, max_tokens=8096, streaming=True
)
# Settings.llm = Groq(model="llama3-70b-8192" , api_key=GROK_API_KEY)
# Set the embedding model
# Option 1: Use FastEmbed with BAAI/bge-base-en-v1.5 model (default)
# Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
# Option 2: Use OpenAI's embedding model (commented out)
# If you want to use OpenAI's embedding model, uncomment the following line:
Settings.embed_model = OpenAIEmbedding(embed_batch_size=10, api_key=OPENAI_API_KEY)
# Qdrant configuration (commented out)
# If you're using Qdrant, uncomment and set these variables:
# QDRANT_CLOUD_ENDPOINT = os.getenv("QDRANT_CLOUD_ENDPOINT")
# QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
# Note: Remember to add QDRANT_CLOUD_ENDPOINT and QDRANT_API_KEY to your .env file if using Qdrant Hosted version
/home/adithya/miniconda3/envs/basic-rag/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
from llama_index.core import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader("../data", recursive=True).load_data(show_progress=True)
Loading files: 100%|██████████| 1/1 [00:00<00:00, 6.11file/s]
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(
transformations=[
MarkdownNodeParser(include_metadata=True),
# TokenTextSplitter(chunk_size=500, chunk_overlap=20),
# SentenceSplitter(chunk_size=1024, chunk_overlap=20),
# SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95 , embed_model=Settings.embed_model),
Settings.embed_model,
],
)
# Ingest directly into a vector db
nodes = pipeline.run(documents=documents , show_progress=True)
print("Number of Nodes:",len(nodes))
Parsing nodes: 100%|██████████| 58/58 [00:00<00:00, 14822.67it/s] Generating embeddings: 100%|██████████| 58/58 [00:01<00:00, 31.58it/s]
Number of Nodes: 58
# initialize a docstore to store nodes
# also available are mongodb, redis, postgres, etc for docstores
import asyncio
from llama_index.core.storage.docstore import SimpleDocumentStore
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
docstore.persist(persist_path="./docstore.json")
## Mongo DB as Document Store
# !pip install llama-index-storage-index-store-mongodb
# !pip install llama-index-storage-docstore-mongodb
# from llama_index.storage.docstore.mongodb import MongoDocumentStore
# from llama_index.storage.kvstore.mongodb import MongoDBKVStore
# from pymongo import MongoClient
# from motor.motor_asyncio import AsyncIOMotorClient
# MONGO_URI = os.getenv("MONGO_URI")
# kv_store = MongoDBKVStore(mongo_client=MongoClient(MONGO_URI) , mongo_aclient=AsyncIOMotorClient(MONGO_URI))
# docstore = MongoDocumentStore(namespace="BM25_RAG" ,mongo_kvstore=kv_store).from_uri(uri=MONGO_URI)
# docstore.add_documents(nodes)
# # !pip install llama-index-storage-docstore-redis
# # !pip install llama-index-storage-index-store-redis
# from llama_index.storage.docstore.redis import RedisDocumentStore
# REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
# REDIS_PORT = os.getenv("REDIS_PORT", 6379)
# docstore=RedisDocumentStore.from_host_and_port(
# host=REDIS_HOST, port=REDIS_PORT, namespace="BM25_RAG"
# )
# docstore.add_documents(nodes)
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer
# We can pass in the index, docstore, or list of nodes to create the retriever
bm25_retriever = BM25Retriever.from_defaults(
docstore=docstore,
similarity_top_k=4,
# Optional: We can pass in the stemmer and set the language for stopwords
# This is important for removing stopwords and stemming the query + text
# The default is english for both
stemmer=Stemmer.Stemmer("english"),
language="english",
)
from llama_index.core.response.notebook_utils import display_source_node
# will retrieve context from specific companies
retrieved_nodes = bm25_retriever.retrieve(
"Who are the Authors of this paper"
)
for node in retrieved_nodes:
display_source_node(node, source_length=5000)
Node ID: 04328457-baaf-4ee5-be1b-70f604c2fe05
Similarity: 1.7577731609344482
Text: Authors
Ashish Vaswani*
Noam Shazeer*
Niki Parmar*
Jakob Uszkoreit*
Google Brain
avaswani@google.com
noam@google.com
nikip@google.com
usz@google.com
Llion Jones*
Aidan N. Gomez* †
Łukasz Kaiser*
Google Research
University of Toronto
llion@google.com
aidan@cs.toronto.edu
lukaszkaiser@google.com
Illia Polosukhin* ‡
illia.polosukhin@gmail.com
Node ID: a2d32369-caf0-4529-ab20-dfd7b49a7705
Similarity: 1.4452868700027466
Text: 5.2 Hardware and Schedule
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, (described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
retrieved_nodes = bm25_retriever.retrieve("What is Attention mechanism")
for node in retrieved_nodes:
display_source_node(node, source_length=5000)
Node ID: e3519af3-3040-4fb5-84d3-485887327d61
Similarity: 2.1414361000061035
Text: What we are missing
In my opinion...
Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.
15
Node ID: 0fd6220c-d85e-4ede-8f33-c1237de9709b
Similarity: 1.9467532634735107
Text: Input-Input Layer 5
The law will never be perfect, but its application should be just.
This is what we are missing, in my opinion.
Node ID: 403d39f5-3650-4caa-bce7-cda0bbd11496
Similarity: 1.4834907054901123
Text: Attention Visualizations
It is this spirit that a majority of American governments have passed new laws since 2009.
Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color.
Voting process more difficult.
---
Node ID: 65456aec-30fc-4363-84f1-d1caa5fcaa28
Similarity: 1.2847681045532227
Text: 2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
from llama_index.core.response_synthesizers import ResponseMode
response_synthesizer = get_response_synthesizer(
response_mode=ResponseMode.COMPACT_ACCUMULATE
)
BM25_QUERY_ENGINE = RetrieverQueryEngine(
retriever=bm25_retriever,
)
response = BM25_QUERY_ENGINE.query("How many encoders are stacked in the transformer?")
display(Markdown(str(response)))