Run a Local AI Model with Ollama
Before we can ask questions about your documents, we need an AI model running on your own machine. The easiest, most beginner-friendly way to do that is a free tool called Ollama. In this lesson you will install it, download a small model, and have your first conversation with an AI that runs entirely offline.
What You'll Learn
- What Ollama is and why it makes local AI simple
- How to install Ollama on your computer
- How to download (pull) a model and chat with it
- How to download an embedding model you will need later
- How to check that Ollama is running in the background
What Is Ollama?
Ollama is a free, open-source program that downloads AI models and runs them on your computer. Think of it as a simple manager for local models: you tell it which model you want, it downloads the file, and then you can chat with that model from your terminal or from your own code. There is no account to create and nothing is sent to the cloud.
Once Ollama is running, it quietly listens in the background at the address http://localhost:11434. That address only exists on your machine. Later in the course, our Python code will send questions to that local address instead of to any external service. That is the heart of what makes this "local."
Step 1: Install Ollama
Go to ollama.com and download the installer for your operating system (Windows, macOS, or Linux). Run it like any normal application. The installer sets everything up for you.
To confirm it worked, open your terminal (Terminal on macOS/Linux, or PowerShell/Command Prompt on Windows) and type:
ollama --version
If you see a version number printed back, Ollama is installed and ready.
Step 2: Pull and Run Your First Model
A "model" is the actual AI brain. We will use Llama 3.2, a small, capable model from Meta that runs comfortably on a normal laptop. The default version is about 3 billion parameters and downloads as roughly a 2 GB file, so the first download may take a few minutes.
Run this single command:
ollama run llama3.2
The first time, Ollama downloads the model. After that, it opens an interactive chat prompt right in your terminal. Try typing a question:
>>> Explain photosynthesis in two sentences.
The model replies, generated entirely on your computer. To leave the chat, type /bye and press Enter.
That is a complete, private, offline AI assistant already. The rest of this course teaches it to answer using your documents.
Tip: If your laptop is older or has limited memory, you can pull an even smaller version with
ollama pull llama3.2:1b. It is faster and lighter, with slightly simpler answers. The 3B default is a good balance for most machines.
Step 3: Pull the Embedding Model
Generating answers is only half of RAG. To find the right pieces of your documents, we need a second, special kind of model called an embedding model. We cover what embeddings are in the next lesson; for now, just download the one we will use:
ollama pull nomic-embed-text
This is a small download. nomic-embed-text is built specifically to turn text into the numbers that make search possible. We will not chat with it; our code will call it behind the scenes.
After this command, you have both models you need for the whole course:
llama3.2to generate answersnomic-embed-textto embed (search) your documents
How the Pieces Fit Together
Here is the role each Ollama model plays in the system you are building.
Two models, two jobs: one finds the right context, the other writes the answer.
| Criteria | llama3.2 | nomic-embed-text |
|---|---|---|
| Job | Writes the final answer | Turns text into searchable numbers |
| Used when | You ask a question | Storing docs and matching your query |
| You chat with it? | Yes, directly | No, the code calls it |
| Pulled with | ollama run llama3.2 | ollama pull nomic-embed-text |
llama3.2
- Job
- Writes the final answer
- Used when
- You ask a question
- You chat with it?
- Yes, directly
- Pulled with
- ollama run llama3.2
nomic-embed-text
- Job
- Turns text into searchable numbers
- Used when
- Storing docs and matching your query
- You chat with it?
- No, the code calls it
- Pulled with
- ollama pull nomic-embed-text
Step 4: Confirm Ollama Is Listening
While Ollama is installed, it runs a small background service. Our Python code later will talk to it at localhost:11434. You can confirm it is alive with this command:
ollama list
This prints the models you have downloaded. If you see llama3.2 and nomic-embed-text in the list, you are fully set up. If the command works, the background service is running and ready for our code to connect.
If you ever restart your computer and the code cannot reach the model, just make sure Ollama is open. On most systems it starts automatically; you can also run
ollama serveto start it manually.
Why This Matters for Privacy
Every model you just downloaded now lives as a file on your disk. When you ask a question, the text goes from your code to localhost (your own machine) and back. It never travels over the internet. That single fact is what gives local RAG its privacy guarantee, and it is why this approach is so appealing for sensitive documents.
Key Takeaways
- Ollama is a free tool that downloads and runs AI models locally; install it from ollama.com.
ollama run llama3.2downloads and chats with a small, capable model. Type/byeto exit.ollama pull nomic-embed-textdownloads the embedding model used to search your documents.- Ollama listens at
http://localhost:11434, an address that only exists on your machine, so nothing is uploaded. - Use
ollama listto confirm both models are downloaded and the service is running.

