The Query Loop: Retrieve, Augment, Generate

Your knowledge base is built and saved. Now comes the satisfying part: asking it questions. This lesson assembles the final piece, the query loop, where a question gets answered using your own documents. This is the moment all three letters of RAG come together: Retrieve, Augment, Generate.

What You'll Learn

The three steps every RAG query runs through
How to retrieve the most relevant chunks from Chroma
How to build a prompt that "augments" the question with that context
How to send everything to your local model and get a grounded answer

The Three Steps

Every question your system answers follows the same three-step loop.

RetrieveFind best chunks in Chroma
AugmentAdd chunks to the prompt
GenerateLocal model writes the answer

Let's build each step. Create a new file called ask.py.

Step 1: Retrieve

First we reconnect to the same database and collection we built in the last lesson. Because Chroma is persistent, the data is already there waiting; we do not rebuild it.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="db")
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name="nomic-embed-text",
)
collection = client.get_or_create_collection(
    name="my_notes",
    embedding_function=ollama_ef,
)

Now we ask Chroma for the chunks most relevant to a question. Chroma embeds the question (using the same model) and runs the similarity search for us, returning the closest chunks.

question = "What are the main causes of inflation?"

results = collection.query(
    query_texts=[question],
    n_results=3,          # return the 3 closest chunks
)
retrieved_chunks = results["documents"][0]

n_results=3 asks for the three best-matching chunks. Three is a good starting number: enough context to answer well, not so much that you overwhelm a small model. retrieved_chunks is now a short list of the most relevant pieces of your documents.

Step 2: Augment

"Augment" simply means we build a prompt that combines the retrieved context with the question, plus a clear instruction. The instruction tells the model to answer from the context and to admit when the answer is not there. This is what keeps answers grounded in your documents instead of made up.

context = "\n\n".join(retrieved_chunks)

prompt = f"""Use ONLY the context below to answer the question.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}

Answer:"""

That phrase "use ONLY the context" is doing important work. It steers the model toward your material and away from guessing. Telling it to say "I don't know" when the context lacks the answer is what reduces made-up answers (often called hallucinations).

Step 3: Generate

Finally we send the augmented prompt to the local llama3.2 model through Ollama and print the answer. We will use the small ollama Python library, which talks to the same local service. Install it once with pip install ollama, then add:

import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": prompt}],
)

print(response["message"]["content"])

ollama.chat sends the prompt to the model running on your machine and returns its reply. We pull the text out of the response and print it. Run python ask.py and you will see an answer written from your own documents.

The Whole Query Loop

Here is ask.py in full, the complete retrieve-augment-generate loop in about 25 lines:

import chromadb
from chromadb.utils import embedding_functions
import ollama

# Reconnect to the saved knowledge base
client = chromadb.PersistentClient(path="db")
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings", model_name="nomic-embed-text",
)
collection = client.get_or_create_collection(
    name="my_notes", embedding_function=ollama_ef,
)

question = "What are the main causes of inflation?"

# 1. RETRIEVE
results = collection.query(query_texts=[question], n_results=3)
context = "\n\n".join(results["documents"][0])

# 2. AUGMENT
prompt = f"""Use ONLY the context below to answer the question.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}

Answer:"""

# 3. GENERATE
response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": prompt}],
)
print(response["message"]["content"])

That is a complete, working, fully local RAG system. Everything (the embedding, the storage, the search, and the answer) happens on your machine.

Make It Interactive

To ask many questions without editing the file each time, wrap the last part in a loop that reads from your keyboard:

while True:
    question = input("\nAsk a question (or type 'quit'): ")
    if question.lower() == "quit":
        break
    results = collection.query(query_texts=[question], n_results=3)
    context = "\n\n".join(results["documents"][0])
    prompt = f"Use ONLY this context to answer.\n\n{context}\n\nQuestion: {question}\n\nAnswer:"
    response = ollama.chat(model="llama3.2",
        messages=[{"role": "user", "content": prompt}])
    print(response["message"]["content"])

Now you have a private chat session with your own notes. Ask follow-up questions as long as you like, all offline and free.

Why the Answers Stay Grounded

Because the model only sees the chunks Chroma retrieved, its answers are anchored to your material. If you ask about something not in your documents and your instruction is doing its job, it will tell you it does not know rather than inventing an answer. That honesty is a feature: it is what makes a study helper trustworthy.

Key Takeaways

A RAG query runs three steps: Retrieve the best chunks, Augment the prompt with them, Generate the answer.
collection.query(query_texts=[question], n_results=3) embeds your question and returns the closest chunks automatically.
The prompt should instruct the model to answer only from the context and to say "I don't know" when the answer is missing, which reduces made-up answers.
ollama.chat(model="llama3.2", ...) sends the augmented prompt to your local model and returns a grounded answer.
The entire loop runs on your machine, so it is private, free, and works offline.

The Query Loop: Retrieve, Augment, Generate

What You'll Learn

The three steps every RAG query runs through
How to retrieve the most relevant chunks from Chroma
How to build a prompt that "augments" the question with that context
How to send everything to your local model and get a grounded answer

The Three Steps

Every question your system answers follows the same three-step loop.

RetrieveFind best chunks in Chroma
AugmentAdd chunks to the prompt
GenerateLocal model writes the answer

Let's build each step. Create a new file called ask.py.

Step 1: Retrieve

First we reconnect to the same database and collection we built in the last lesson. Because Chroma is persistent, the data is already there waiting; we do not rebuild it.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="db")
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name="nomic-embed-text",
)
collection = client.get_or_create_collection(
    name="my_notes",
    embedding_function=ollama_ef,
)

Now we ask Chroma for the chunks most relevant to a question. Chroma embeds the question (using the same model) and runs the similarity search for us, returning the closest chunks.

question = "What are the main causes of inflation?"

results = collection.query(
    query_texts=[question],
    n_results=3,          # return the 3 closest chunks
)
retrieved_chunks = results["documents"][0]

Step 2: Augment

context = "\n\n".join(retrieved_chunks)

prompt = f"""Use ONLY the context below to answer the question.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}

Answer:"""

Step 3: Generate

import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": prompt}],
)

print(response["message"]["content"])

The Whole Query Loop

Here is ask.py in full, the complete retrieve-augment-generate loop in about 25 lines:

import chromadb
from chromadb.utils import embedding_functions
import ollama

# Reconnect to the saved knowledge base
client = chromadb.PersistentClient(path="db")
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings", model_name="nomic-embed-text",
)
collection = client.get_or_create_collection(
    name="my_notes", embedding_function=ollama_ef,
)

question = "What are the main causes of inflation?"

# 1. RETRIEVE
results = collection.query(query_texts=[question], n_results=3)
context = "\n\n".join(results["documents"][0])

# 2. AUGMENT
prompt = f"""Use ONLY the context below to answer the question.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}

Answer:"""

# 3. GENERATE
response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": prompt}],
)
print(response["message"]["content"])

That is a complete, working, fully local RAG system. Everything (the embedding, the storage, the search, and the answer) happens on your machine.

Make It Interactive

To ask many questions without editing the file each time, wrap the last part in a loop that reads from your keyboard:

while True:
    question = input("\nAsk a question (or type 'quit'): ")
    if question.lower() == "quit":
        break
    results = collection.query(query_texts=[question], n_results=3)
    context = "\n\n".join(results["documents"][0])
    prompt = f"Use ONLY this context to answer.\n\n{context}\n\nQuestion: {question}\n\nAnswer:"
    response = ollama.chat(model="llama3.2",
        messages=[{"role": "user", "content": prompt}])
    print(response["message"]["content"])

Now you have a private chat session with your own notes. Ask follow-up questions as long as you like, all offline and free.

Why the Answers Stay Grounded

Key Takeaways

A RAG query runs three steps: Retrieve the best chunks, Augment the prompt with them, Generate the answer.
collection.query(query_texts=[question], n_results=3) embeds your question and returns the closest chunks automatically.
The prompt should instruct the model to answer only from the context and to say "I don't know" when the answer is missing, which reduces made-up answers.
ollama.chat(model="llama3.2", ...) sends the augmented prompt to your local model and returns a grounded answer.
The entire loop runs on your machine, so it is private, free, and works offline.

The Query Loop: Retrieve, Augment, Generate

What You'll Learn

The Three Steps

Step 1: Retrieve

Step 2: Augment

Step 3: Generate

The Whole Query Loop

Make It Interactive

Why the Answers Stay Grounded

Key Takeaways

Quiz

Questions & Answers

The Query Loop: Retrieve, Augment, Generate

What You'll Learn

The Three Steps

Step 1: Retrieve

Step 2: Augment

Step 3: Generate

The Whole Query Loop

Make It Interactive

Why the Answers Stay Grounded

Key Takeaways

Quiz

Questions & Answers