BM25 RAG (Retrieval-Augmented Generation)
Introduction
BM25 Retrieval-Augmented Generation (BM25 RAG) is an advanced technique that combines the power of the BM25 (Best Matching 25) algorithm for information retrieval with large language models for text generation. This approach enhances the accuracy and relevance of generated responses by grounding them in specific, retrieved information using a proven probabilistic retrieval model.
BM25 RAG Workflow
flowchart TD
subgraph "1. Document Processing"
A[Documents] --> B[Split Text into Chunks]
B --> C1[Chunk-1]
B --> C2[Chunk-2]
B --> C3[Chunk-n]
end
subgraph "2. Indexing"
C1 & C2 & C3 --> D[Tokenization]
D --> E[TF-IDF Calculation]
E --> F[(Inverted Index)]
end
subgraph "3. Query Processing"
G[Query] --> H[Tokenization]
H --> I[Query Terms]
end
subgraph "4. Retrieval"
I -->|Term Matching| F
F -->|BM25 Scoring| J[Relevant Chunks]
end
subgraph "5. Context Formation"
J --> K[Query + Relevant Chunks]
end
subgraph "6. Generation"
K --> L[LLM]
L --> M[Response]
end
G --> K
Getting Started
Notebook
You can run the Jupyter notebook provided in this repository to explore BM25 RAG in detail.
Chat Application
- Install dependencies:
- Run the application:
- To ingest data on the go:
Server
Run the server with:
The server has two endpoints:
/api/ingest
: For ingesting new documents/api/query
: For querying the BM25 RAG system
Key Features of BM25 RAG
- Probabilistic Retrieval: BM25 uses a probabilistic model to rank documents, providing a theoretically sound basis for retrieval.
- Term Frequency Saturation: BM25 accounts for diminishing returns from repeated terms, improving retrieval quality.
- Document Length Normalization: The algorithm considers document length, reducing bias towards longer documents.
- Contextual Relevance: By grounding responses in retrieved information, BM25 RAG produces more accurate and relevant answers.
- Scalability: The BM25 retrieval step can handle large document collections efficiently.
Benefits of BM25 RAG
- Improved Accuracy: Combines the strengths of probabilistic retrieval and neural text generation.
- Interpretability: BM25 scoring provides a more interpretable retrieval process compared to dense vector retrieval methods.
- Handling Long-tail Queries: Particularly effective for queries requiring specific or rare information.
- No Embedding Required: Unlike vector-based RAG, BM25 doesn't require document embeddings, reducing computational overhead.
Prerequisites
- Python 3.7+
- Jupyter Notebook or JupyterLab (for running the notebook)
- Required Python packages (see
requirements.txt
) - API key for the chosen Language Model (e.g., OpenAI API key)
Contributing
We welcome contributions! Please see our Contributing Guidelines for more details.
License
This project is licensed under the MIT License.
Acknowledgments
- AI Engineering Academy for supporting this project
- All contributors and community members
For more information, visit AI Engineering Academy.