RAGatouille Retriever Llama Pack¶

RAGatouille is a cool library that lets you use e.g. ColBERT and other SOTA retrieval models in your RAG pipeline. You can use it to either run inference on ColBERT, or use it to train/fine-tune models.

This LlamaPack shows you an easy way to bundle RAGatouille into your RAG pipeline. We use RAGatouille to index a corpus of documents (by default using colbertv2.0), and then we combine it with LlamaIndex query modules to synthesize an answer with an LLM.

In [ ]:

  Copied!     
 
%pip install llama-index-llms-openai
%pip install llama-index-packs-ragatouille-retriever
%pip install llama-index-llms-openai %pip install llama-index-packs-ragatouille-retriever

In [ ]:

  Copied!     
 
# Option: if developing with the llama_hub package
from llama_index.packs.ragatouille_retriever import RAGatouilleRetrieverPack

# Option: download_llama_pack
from llama_index.core.llama_pack import download_llama_pack

# RAGatouilleRetrieverPack = download_llama_pack(
#     "RAGatouilleRetrieverPack",
#     "./ragatouille_pack",
#     skip_load=True,
#     # leave the below line commented out if using the notebook on main
#     # llama_hub_url="https://raw.githubusercontent.com/run-llama/llama-hub/jerry/add_llm_compiler_pack/llama_hub"
# )
# Option: if developing with the llama_hub package from llama_index.packs.ragatouille_retriever import RAGatouilleRetrieverPack # Option: download_llama_pack from llama_index.core.llama_pack import download_llama_pack # RAGatouilleRetrieverPack = download_llama_pack( # "RAGatouilleRetrieverPack", # "./ragatouille_pack", # skip_load=True, # # leave the below line commented out if using the notebook on main # # llama_hub_url="https://raw.githubusercontent.com/run-llama/llama-hub/jerry/add_llm_compiler_pack/llama_hub" # )

Load Documents¶

Here we load the ColBERTv2 paper: https://arxiv.org/pdf/2112.01488.pdf.

In [ ]:

  Copied!     
 
!wget "https://arxiv.org/pdf/2004.12832.pdf" -O colbertv1.pdf
!wget "https://arxiv.org/pdf/2004.12832.pdf" -O colbertv1.pdf

--2024-01-04 16:02:16--  https://arxiv.org/pdf/2004.12832.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4918165 (4.7M) [application/pdf]
Saving to: ‘colbertv1.pdf’

colbertv1.pdf       100%[===================>]   4.69M  --.-KB/s    in 0.1s    

2024-01-04 16:02:16 (34.6 MB/s) - ‘colbertv1.pdf’ saved [4918165/4918165]

In [ ]:

  Copied!     
 
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

reader = SimpleDirectoryReader(input_files=["colbertv1.pdf"])
docs = reader.load_data()
from llama_index.core import SimpleDirectoryReader from llama_index.llms.openai import OpenAI reader = SimpleDirectoryReader(input_files=["colbertv1.pdf"]) docs = reader.load_data()

Create Pack¶

In [ ]:

  Copied!     
 
index_name = "my_index"
ragatouille_pack = RAGatouilleRetrieverPack(
    docs, llm=OpenAI(model="gpt-3.5-turbo"), index_name=index_name, top_k=5
)
index_name = "my_index" ragatouille_pack = RAGatouilleRetrieverPack( docs, llm=OpenAI(model="gpt-3.5-turbo"), index_name=index_name, top_k=5 )

/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm


[Jan 04, 16:02:19] #> Note: Output directory .ragatouille/colbert/indexes/my_index already exists


[Jan 04, 16:02:19] #> Will delete 10 files already at .ragatouille/colbert/indexes/my_index in 20 seconds...
#> Starting...
[Jan 04, 16:02:42] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(
  0%|          | 0/2 [00:00<?, ?it/s]/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

[Jan 04, 16:02:43] [0] 		 #> Encoding 90 passages..

 50%|█████     | 1/2 [00:03<00:03,  3.87s/it]/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(
100%|██████████| 2/2 [00:05<00:00,  2.64s/it]
WARNING clustering 14894 points to 1024 centroids: please provide at least 39936 training points

[Jan 04, 16:02:48] [0] 		 avg_doclen_est = 174.1888885498047 	 len(local_sample) = 90
[Jan 04, 16:02:48] [0] 		 Creating 1,024 partitions.
[Jan 04, 16:02:48] [0] 		 *Estimated* 15,676 embeddings.
[Jan 04, 16:02:48] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/my_index/plan.json ..
Clustering 14894 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
[0.037, 0.037, 0.033, 0.033, 0.033, 0.035, 0.035, 0.035, 0.032, 0.036, 0.032, 0.031, 0.035, 0.036, 0.035, 0.036, 0.034, 0.037, 0.033, 0.034, 0.036, 0.036, 0.035, 0.035, 0.033, 0.036, 0.036, 0.033, 0.037, 0.035, 0.035, 0.037, 0.036, 0.033, 0.037, 0.031, 0.035, 0.036, 0.035, 0.042, 0.037, 0.037, 0.037, 0.036, 0.036, 0.033, 0.034, 0.037, 0.036, 0.032, 0.034, 0.036, 0.038, 0.038, 0.035, 0.034, 0.039, 0.035, 0.036, 0.034, 0.035, 0.038, 0.035, 0.037, 0.035, 0.036, 0.04, 0.033, 0.034, 0.034, 0.038, 0.034, 0.038, 0.036, 0.038, 0.035, 0.037, 0.04, 0.036, 0.04, 0.037, 0.037, 0.037, 0.037, 0.034, 0.036, 0.034, 0.037, 0.032, 0.039, 0.037, 0.036, 0.034, 0.038, 0.035, 0.033, 0.039, 0.036, 0.035, 0.035, 0.039, 0.038, 0.034, 0.035, 0.037, 0.033, 0.033, 0.031, 0.035, 0.035, 0.035, 0.038, 0.036, 0.033, 0.035, 0.035, 0.038, 0.035, 0.035, 0.036, 0.036, 0.039, 0.036, 0.039, 0.034, 0.038, 0.038, 0.034]
[Jan 04, 16:02:48] [0] 		 #> Encoding 90 passages..

0it [00:00, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s]
 50%|█████     | 1/2 [00:03<00:03,  3.32s/it]
100%|██████████| 2/2 [00:04<00:00,  2.34s/it]
1it [00:04,  4.72s/it]
100%|██████████| 1/1 [00:00<00:00, 5322.72it/s]
100%|██████████| 1024/1024 [00:00<00:00, 331171.82it/s]

[Jan 04, 16:02:53] #> Optimizing IVF to store map from centroids to list of pids..
[Jan 04, 16:02:53] #> Building the emb2pid mapping..
[Jan 04, 16:02:53] len(emb2pid) = 15677
[Jan 04, 16:02:53] #> Saved optimized IVF to .ragatouille/colbert/indexes/my_index/ivf.pid.pt

#> Joined...
Done indexing!

Try out Pack¶

We try out both the individual modules in the pack as well as running it e2e!

In [ ]:

  Copied!     
 
from llama_index.core.response.notebook_utils import display_source_node

retriever = ragatouille_pack.get_modules()["retriever"]
nodes = retriever.retrieve("How does ColBERTv2 compare with other BERT models?")

for node in nodes:
    display_source_node(node)
from llama_index.core.response.notebook_utils import display_source_node retriever = ragatouille_pack.get_modules()["retriever"] nodes = retriever.retrieve("How does ColBERTv2 compare with other BERT models?") for node in nodes: display_source_node(node)

New index_name received! Updating current index_name (my_index) to my_index
Loading searcher for index my_index for the first time... This may take a few seconds
[Jan 04, 16:02:55] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jan 04, 16:02:56] #> Loading codec...
[Jan 04, 16:02:56] #> Loading IVF...
[Jan 04, 16:02:56] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn(

[Jan 04, 16:02:56] #> Loading doclens...

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5555.37it/s]

[Jan 04, 16:02:56] #> Loading codes and residuals...

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 521.10it/s]

[Jan 04, 16:02:56] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...

[Jan 04, 16:02:56] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . How does ColBERTv2 compare with SPLADEv2?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2129,  2515, 23928,  2615,  2475, 12826,  2007, 11867,
        27266,  6777,  2475,  1029,   102,   103,   103,   103,   103,   103,
          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,
          103,   103])
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])

/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

Node ID: 5e4028f7-fbb5-4440-abd0-0d8270cc8979
Similarity: 17.003997802734375
Text: While highly competitive in eﬀec- tiveness, ColBERT is orders of magnitude cheaper than BERT base...

Node ID: d6240a29-0a5e-458f-86f1-abe570e13200
Similarity: 16.764663696289062
Text: Note that any BERT-based model must incur the computational cost of processing each document at l...

Node ID: d19c0fe7-bdb7-4a51-ae89-00cd746b2d3a
Similarity: 16.70589828491211
Text: For instance, its Recall@50 actually exceeds the oﬃcial BM25’s Recall@1000 and even all but docTT...

Node ID: 38e84e5b-4345-4b08-a7fd-de2de4fa645a
Similarity: 16.577777862548828
Text: /T_his layer serves to control the dimension of ColBERT’s embeddings, producing m-dimensional emb...

Node ID: c82df506-412a-40c2-baf3-df51ab43e434
Similarity: 16.252092361450195
Text: For instance, at k=10, BERT requires nearly 180more FLOPs than ColBERT; at k=1000, BERT’s overhe...

In [ ]:

  Copied!     
 
# try out the RAG module directly
RAG = ragatouille_pack.get_modules()["RAG"]
results = RAG.search(
    "How does ColBERTv2 compare with other BERT models?", index_name=index_name, k=4
)
results
# try out the RAG module directly RAG = ragatouille_pack.get_modules()["RAG"] results = RAG.search( "How does ColBERTv2 compare with other BERT models?", index_name=index_name, k=4 ) results

/Users/jerryliu/Programming/llama-hub/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

Out[ ]:

[{'content': 'While highly competitive in eﬀec-\ntiveness, ColBERT is orders of magnitude cheaper than BERT base,\nin particular, by over 170 \x02in latency and 13,900 \x02in FLOPs. /T_his\nhighlights the expressiveness of our proposed late interaction mech-\nanism, particularly when coupled with a powerful pre-trained LM\nlike BERT. While ColBERT’s re-ranking latency is slightly higher\nthan the non-BERT re-ranking models shown (i.e., by 10s of mil-\nliseconds), this diﬀerence is explained by the time it takes to gather,\nstack, and transfer the document embeddings to the GPU. In partic-\nular, the query encoding and interaction in ColBERT consume only\n13 milliseconds of its total execution time. We note that ColBERT’s\nlatency and FLOPs can be considerably reduced by padding queries\nto a shorter length, using smaller vector dimensions (the MRR@10\nof which is tested in §4.5), employing quantization of the document\n6h/t_tps://github.com/mit-han-lab/torchpro/f_ile',
  'score': 17.003997802734375,
  'rank': 1},
 {'content': 'Note that any BERT-based model\nmust incur the computational cost of processing each document\nat least once. While ColBERT encodes each document with BERT\nexactly once, existing BERT-based rankers would repeat similar\ncomputations on possibly hundreds of documents for each query.\nSe/t_ting Dimension( m) Bytes/Dim Space(GiBs) MRR@10\nRe-rank Cosine 128 4 286 34.9\nEnd-to-end L2 128 2 154 36.0\nRe-rank L2 128 2 143 34.8\nRe-rank Cosine 48 4 54 34.4\nRe-rank Cosine 24 2 27 33.9\nTable 4: Space Footprint vs MRR@10 (Dev) on MS MARCO.\nTable 4 reports the space footprint of ColBERT under various\nse/t_tings as we reduce the embeddings dimension and/or the bytes\nper dimension.',
  'score': 16.764663696289062,
  'rank': 2},
 {'content': 'For instance,\nits Recall@50 actually exceeds the oﬃcial BM25’s Recall@1000 and\neven all but docTTTTTquery’s Recall@200, emphasizing the value\nof end-to-end retrieval (instead of just re-ranking) with ColBERT.\n4.4 Ablation Studies\n0.220.240.260.280.300.320.340.36\nMRR@10BERT [CLS]-based dot-product (5-layer)  [A]\nColBERT via average similarity (5-layer)  [B]\nColBERT without query augmentation (5-layer)  [C]\nColBERT (5-layer)  [D]\nColBERT (12-layer)  [E]\nColBERT + e2e retrieval (12-layer)  [F]\nFigure 5: Ablation results on MS MARCO (Dev). Between\nbrackets is the number of BERT layers used in each model.\n/T_he results from §4.2 indicate that ColBERT is highly eﬀective\ndespite the low cost and simplicity of its late interaction mechanism.',
  'score': 16.70589828491211,
  'rank': 3},
 {'content': '/T_his layer serves to control the dimension\nof ColBERT’s embeddings, producing m-dimensional embeddings\nfor the layer’s output size m. As we discuss later in more detail,\nwe typically /f_ix mto be much smaller than BERT’s /f_ixed hidden\ndimension.\nWhile ColBERT’s embedding dimension has limited impact on\nthe eﬃciency of query encoding, this step is crucial for controlling\nthe space footprint of documents, as we show in §4.5. In addition, it\ncan have a signi/f_icant impact on query execution time, particularly\nthe time taken for transferring the document representations onto\nthe GPU from system memory (where they reside before processing\na query). In fact, as we show in §4.2, gathering, stacking, and\ntransferring the embeddings from CPU to GPU can be the most\nexpensive step in re-ranking with ColBERT. Finally, the output\nembeddings are normalized so each has L2 norm equal to one.\n/T_he result is that the dot-product of any two embeddings becomes\nequivalent to their cosine similarity, falling in the »\x001;1¼range.\nDocument Encoder.',
  'score': 16.577777862548828,
  'rank': 4}]

In [ ]:

  Copied!     
 
# run pack e2e, which includes the full query engine with OpenAI LLMs
response = ragatouille_pack.run("How does ColBERTv2 compare with other BERT models?")
print(str(response))
# run pack e2e, which includes the full query engine with OpenAI LLMs response = ragatouille_pack.run("How does ColBERTv2 compare with other BERT models?") print(str(response))

ColBERTv2, which employs late interaction over BERT base, performs no worse than the original adaptation of BERT base for ranking. It is only marginally less effective than BERT large and our training of BERT base. While highly competitive in effectiveness, ColBERTv2 is orders of magnitude cheaper than BERT base, particularly in terms of latency and FLOPs.