RAG: Probabilistic Information Retrieval From Scratch
Introduction
All the decisions we make throughout our lives are shaped by information. Whether directly or indirectly, our tastes, beliefs, and perceptions of the world are deeply influenced by context. As Seneca once said, "There is no favorable wind for the sailor who doesn’t know where to go"—in other words, information only becomes relevant when we have a clear goal, a context in which it makes sense. And I’m not talking about superficial information, like 30-second videos that inject us with dopamine, but rather true and useful information.
What I want to highlight is that a generative AI is specific and effective within a given context. In this article, I will explain how I implemented a categorization function from scratch and demonstrate the importance of this process in building more precise systems adapted to real-world usage contexts.
RAG
As I mentioned at the beginning, throughout our journey, we make decisions based on information, and it is context that gives value to this information. Isolated or disconnected information is not enough; a clear focus and direction are necessary. Retrieval-Augmented Generation (RAG) is precisely that: a technique that enhances the output of large language models by allowing them to access a reliable external knowledge base before generating a response.
Language models, such as LLMs, are trained on vast amounts of data and use billions of parameters to generate responses for tasks like answering questions, translating text, or completing sentences. However, the true potential of LLMs expands when we combine these models with external, domain-specific, and contextually relevant sources. With RAG, there is no need to retrain the model for it to adapt to a specific domain or access an organization's internal knowledge base. This approach not only makes the process more efficient but also keeps the model relevant and useful, enhancing its capabilities without compromising accuracy.
BM25
The function we will explore and reproduce in this article is BM25, or Best Matching 25. It is based on the probabilistic information retrieval model, which assumes that there is a probability distribution determining the relevance of a document concerning a query. In practice:
In the BM25 formula, score ( 𝑑 , 𝑄 ) score(d,Q) represents the relevance score of document 𝑑 d concerning the query 𝑄 Q. The summation iterates over each term 𝑞 𝑖 q i present in the query. The first factor is the Inverse Document Frequency (IDF), which measures the importance of the term in the document collection. Here, 𝑁 N is the total number of documents, while 𝑛 𝑞 𝑖 n q i
is the number of documents containing the term 𝑞 𝑖 q i . This expression ensures that common terms have less weight, whereas rare terms gain more importance.
The second factor is a smoothed version of the term frequency in the document. The numerator amplifies the term count in document 𝑑 d, where 𝑓 𝑞 𝑖 , 𝑑 f q i ,d is the number of occurrences of term 𝑞 𝑖 q i in 𝑑 d. The denominator normalizes this value considering the document length. Here, ∣ 𝑑 ∣ ∣d∣ represents the total number of words in document 𝑑 d, and avgDL avgDL is the average document length in the collection. The parameter 𝑘 1 k 1 controls term weight saturation, with common values between 1.2 and 2.0, while 𝑏 b adjusts the influence of document length, usually set to 0.75.
Overall, this formula balances term importance by considering its global presence in the collection, local frequency in the document, and the document's relative length compared to others.
Let's proceed with the implementation of BM25, considering the previous context.
class BM25 {
private k1: number;
private b: number;
private corpus: string[];
constructor(corpus: string[], k1: number = 1.5, b: number = 0.75) {
this.k1 = k1;
this.b = b;
this.corpus = corpus;
}
private idf(term: string): number {
const docCount = this.corpus.length;
const docWithTerm = this.corpus.filter((doc) => doc.includes(term)).length;
return Math.log((docCount - docWithTerm + 0.5) / (docWithTerm + 0.5) + 1.0);
}
private termFrequency(term: string, doc: string): number {
return (doc.match(new RegExp(`\\b${term}\\b`, "g")) || []).length;
}
private avgDocumentLength(): number {
const totalLength = this.corpus.reduce(
(sum, doc) => sum + doc.split(" ").length,
0
);
return totalLength / this.corpus.length;
}
public score(query: string, doc: string): number {
const terms = query.split(" ");
const docLength = doc.split(" ").length;
const avgLength = this.avgDocumentLength();
let score = 0;
for (let term of terms) {
const tf = this.termFrequency(term, doc);
const idf = this.idf(term);
const termScore =
idf *
((tf * (this.k1 + 1)) /
(tf + this.k1 * (1 - this.b + (this.b * docLength) / avgLength)));
score += termScore;
}
return score;
}
public rank(query: string): string[] {
const scores = this.corpus.map((doc) => ({
doc,
score: this.score(query, doc),
}));
return scores.sort((a, b) => b.score - a.score).map((s) => s.doc);
}
}
The code above is a representation of the formula shown earlier.
Ollama
I created an abstraction of Ollama to make things easier.
class OllamaLocal {
private apiUrl: string;
private model: string;
constructor(model: string, apiUrl: string = "http://localhost:11434/api/generate") {
this.apiUrl = apiUrl;
this.model = model;
}
public async generateResponse(query: string, context: string[]): Promise<string> {
try {
const contextText = context.join("\n");
const { body } = await request(this.apiUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
model: this.model,
prompt: `Context: \n${contextText}\n\nQuestion: ${query}\nAnswer directly and only as necessary based on the provided context.`,
stream: false
}),
});
const data = await body.json() as any;
return data.response?.trim() || "No response generated.";
} catch (error) {
console.error("Error generating response with Ollama:", error);
return "Error generating response.";
}
}
}
class BM25WithOllama {
private bm25: BM25;
private ollama: OllamaLocal;
constructor(bm25: BM25, ollama: OllamaLocal) {
this.bm25 = bm25;
this.ollama = ollama;
}
public async rankAndGenerateAnswer(query: string): Promise<string[]> {
const rankedDocs = this.bm25.rank(query);
return rankedDocs;
}
public async generateAnswer(query: string): Promise<string> {
const rankedDocs = this.bm25.rank(query);
return await this.ollama.generateResponse(query, rankedDocs);
}
}
Deepseek
Now testing the implementation:
(async () => {
const corpus = [
"How can I contact customer support? | You can contact customer support by emailing support@yourcompany.com or calling (123) 456-7890.",
"What are your business hours? | Our business hours are Monday to Friday, from 9:00 AM to 6:00 PM.",
"How do I track my order? | To track your order, visit our website and use the tracking number provided in your confirmation email.",
"Do you offer international shipping? | Yes, we offer international shipping to over 50 countries. Shipping fees vary based on destination.",
"How can I return a product? | You can return a product within 30 days of purchase by visiting our returns page on the website or contacting customer support."
];
const bm25 = new BM25(corpus);
const ollama = new OllamaLocal("deepseek-r1:8b");
const bm25WithOllama = new BM25WithOllama(bm25, ollama);
const query = "Are you open at 8 AM?";
const rankedDocs = await bm25WithOllama.rankAndGenerateAnswer(query);
console.log("Ranked FAQs:");
rankedDocs.forEach((doc, index) => {
const [faq, response] = doc.split(" | ");
console.log(`${index + 1}: ${faq} - ${response}`);
});
const ollamaResponse = await bm25WithOllama.generateAnswer(query);
console.log("AI response:", ollamaResponse.split("\n")[ollamaResponse.split("\n").length - 1])
})();
You should see something like:
1: Do you offer international shipping? - Yes, we offer international shipping to over 50 countries. Shipping fees vary based on destination.
2: How can I contact customer support? - You can contact customer support by emailing support@yourcompany.com or calling (123) 456-7890.
3: What are your business hours? - Our business hours are Monday to Friday, from 9:00 AM to 6:00 PM.
4: How do I track my order? - To track your order, visit our website and use the tracking number provided in your confirmation email.
5: How can I return a product? - You can return a product within 30 days of purchase by visiting our returns page on the website or contacting customer support.
AI response: No, the company does not open at 8 AM. Their business hours are Monday to Friday, from 9:00 AM to 6:00 PM.
The true power of this tool lies in how it organizes information. With each new query, the ranking dynamically adjusts to highlight the most relevant results, ensuring that the response always aligns with the given context.
Try modifying the query and observe how the ranking updates in real time. With every iteration, we reevaluate the available options and select the most suitable responses based on the new query. Experiment with different approaches and see how the results adapt!
For this article, I’m using the deepseek-r1:8b model, but feel free to choose the model that best suits your needs. Explore different options and compare the results to see what works best for you.
Conclusion
I hope this article has given you a clear understanding of how a ranking system works and how it can enhance the contextualization of an LLM, ultimately improving its accuracy in generating relevant responses. The ability to dynamically adjust rankings based on queries is a powerful technique that can significantly optimize interactions with large language models.
This is just the beginning of a series of publications I plan to share, where I will dive deeper into various AI and statistical concepts.