The Query Loop: Retrieve, Augment, Generate
Your knowledge base is built and saved. Now comes the satisfying part: asking it questions. This lesson assembles the final piece, the query loop, where a question gets answered using your own documents. This is the moment all three letters of RAG come together: Retrieve, Augment, Generate.
What You'll Learn
- The three steps every RAG query runs through
- How to retrieve the most relevant chunks from Chroma
- How to build a prompt that "augments" the question with that context
- How to send everything to your local model and get a grounded answer
The Three Steps
Every question your system answers follows the same three-step loop.
- RetrieveFind best chunks in Chroma
- AugmentAdd chunks to the prompt
- GenerateLocal model writes the answer
Let's build each step. Create a new file called ask.py.
Step 1: Retrieve
First we reconnect to the same database and collection we built in the last lesson. Because Chroma is persistent, the data is already there waiting; we do not rebuild it.
import chromadb
from chromadb.utils import embedding_functions
client = chromadb.PersistentClient(path="db")
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
url="http://localhost:11434/api/embeddings",
model_name="nomic-embed-text",
)
collection = client.get_or_create_collection(
name="my_notes",
embedding_function=ollama_ef,
)
Now we ask Chroma for the chunks most relevant to a question. Chroma embeds the question (using the same model) and runs the similarity search for us, returning the closest chunks.
question = "What are the main causes of inflation?"
results = collection.query(
query_texts=[question],
n_results=3, # return the 3 closest chunks
)
retrieved_chunks = results["documents"][0]
n_results=3 asks for the three best-matching chunks. Three is a good starting number: enough context to answer well, not so much that you overwhelm a small model. retrieved_chunks is now a short list of the most relevant pieces of your documents.
Step 2: Augment
"Augment" simply means we build a prompt that combines the retrieved context with the question, plus a clear instruction. The instruction tells the model to answer from the context and to admit when the answer is not there. This is what keeps answers grounded in your documents instead of made up.
context = "\n\n".join(retrieved_chunks)
prompt = f"""Use ONLY the context below to answer the question.
If the answer is not in the context, say you don't know.
Context:
{context}
Question: {question}
Answer:"""
That phrase "use ONLY the context" is doing important work. It steers the model toward your material and away from guessing. Telling it to say "I don't know" when the context lacks the answer is what reduces made-up answers (often called hallucinations).
Step 3: Generate
Finally we send the augmented prompt to the local llama3.2 model through Ollama and print the answer. We will use the small ollama Python library, which talks to the same local service. Install it once with pip install ollama, then add:
import ollama
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
)
print(response["message"]["content"])
ollama.chat sends the prompt to the model running on your machine and returns its reply. We pull the text out of the response and print it. Run python ask.py and you will see an answer written from your own documents.
The Whole Query Loop
Here is ask.py in full, the complete retrieve-augment-generate loop in about 25 lines:
import chromadb
from chromadb.utils import embedding_functions
import ollama
# Reconnect to the saved knowledge base
client = chromadb.PersistentClient(path="db")
ollama_ef = embedding_functions.OllamaEmbeddingFunction(
url="http://localhost:11434/api/embeddings", model_name="nomic-embed-text",
)
collection = client.get_or_create_collection(
name="my_notes", embedding_function=ollama_ef,
)
question = "What are the main causes of inflation?"
# 1. RETRIEVE
results = collection.query(query_texts=[question], n_results=3)
context = "\n\n".join(results["documents"][0])
# 2. AUGMENT
prompt = f"""Use ONLY the context below to answer the question.
If the answer is not in the context, say you don't know.
Context:
{context}
Question: {question}
Answer:"""
# 3. GENERATE
response = ollama.chat(
model="llama3.2",
messages=[{"role": "user", "content": prompt}],
)
print(response["message"]["content"])
That is a complete, working, fully local RAG system. Everything (the embedding, the storage, the search, and the answer) happens on your machine.
Make It Interactive
To ask many questions without editing the file each time, wrap the last part in a loop that reads from your keyboard:
while True:
question = input("\nAsk a question (or type 'quit'): ")
if question.lower() == "quit":
break
results = collection.query(query_texts=[question], n_results=3)
context = "\n\n".join(results["documents"][0])
prompt = f"Use ONLY this context to answer.\n\n{context}\n\nQuestion: {question}\n\nAnswer:"
response = ollama.chat(model="llama3.2",
messages=[{"role": "user", "content": prompt}])
print(response["message"]["content"])
Now you have a private chat session with your own notes. Ask follow-up questions as long as you like, all offline and free.
Why the Answers Stay Grounded
Because the model only sees the chunks Chroma retrieved, its answers are anchored to your material. If you ask about something not in your documents and your instruction is doing its job, it will tell you it does not know rather than inventing an answer. That honesty is a feature: it is what makes a study helper trustworthy.
Key Takeaways
- A RAG query runs three steps: Retrieve the best chunks, Augment the prompt with them, Generate the answer.
collection.query(query_texts=[question], n_results=3)embeds your question and returns the closest chunks automatically.- The prompt should instruct the model to answer only from the context and to say "I don't know" when the answer is missing, which reduces made-up answers.
ollama.chat(model="llama3.2", ...)sends the augmented prompt to your local model and returns a grounded answer.- The entire loop runs on your machine, so it is private, free, and works offline.

