Generators and Iterators for Big Data
When a dataset is too big to fit in memory, or when you only need values one at a time, generators are the answer. They produce values lazily, on demand, instead of building a giant list up front. This is one of the most practical advanced-Python skills for data and AI work, where files routinely have millions of rows.
What You'll Learn
- What "lazy evaluation" means and why it matters
- How to write a generator with
yield - The difference between an iterable and an iterator
- How to chain generators into a memory-light pipeline
The Problem: Loading Everything at Once
Reading a huge file into a list loads every row into memory at the same time:
def read_all(path):
rows = []
with open(path) as f:
for line in f:
rows.append(line.strip())
return rows # the entire file now lives in memory
If the file is 5 GB, this crashes. A generator processes one line at a time and never holds more than one in memory.
Writing a Generator with yield
Replace return with yield and the function becomes a generator. Each yield hands back one value and pauses until the next value is requested.
def read_lines(path):
with open(path) as f:
for line in f:
yield line.strip()
for line in read_lines("big.log"):
process(line) # one line in memory at a time
Calling read_lines(...) does not run the body immediately. It returns a generator object. The code runs a little at a time, each time the loop asks for the next value. That is lazy evaluation.
Iterable vs Iterator
This distinction clears up a lot of confusion:
A generator is an iterator: it yields each value once, then it's spent.
| Criteria | Iterable | Iterator |
|---|---|---|
| What it is | Something you can loop over | The thing that produces values one by one |
| Examples | list, tuple, dict, str, file | result of iter(...), a generator |
| Can exhaust | No, loop it again | Yes, once consumed it's empty |
Iterable
- What it is
- Something you can loop over
- Examples
- list, tuple, dict, str, file
- Can exhaust
- No, loop it again
Iterator
- What it is
- The thing that produces values one by one
- Examples
- result of iter(...), a generator
- Can exhaust
- Yes, once consumed it's empty
A key gotcha: a generator is consumed once. After you loop over it, it is empty. If you need the values twice, either re-create the generator or collect them into a list with list(gen).
Building a Pipeline
The real power is chaining generators. Each stage stays lazy, so even a multi-step transform over a massive file uses almost no memory.
def read_lines(path):
with open(path) as f:
for line in f:
yield line.strip()
def only_errors(lines):
for line in lines:
if "ERROR" in line:
yield line
def to_upper(lines):
for line in lines:
yield line.upper()
# Nothing runs until we consume the final loop:
pipeline = to_upper(only_errors(read_lines("app.log")))
for line in pipeline:
print(line)
- read_linesone line at a time
- only_errorskeep ERROR lines
- to_uppertransform
- consumeprint / sum / save
Generator Expressions Recap
You met these last lesson. The parenthesized form is just an inline generator:
squares = (x * x for x in range(1_000_000)) # lazy, no big list
total = sum(squares)
Use the def ... yield form when the logic needs multiple lines or its own name; use the expression form for short one-liners.
Try It
Run this. The countdown generator yields values lazily. Then add a second generator that doubles each value before printing.
Key Takeaways
- Generators produce values lazily with
yield, holding one value at a time instead of a whole list. - Calling a generator function returns a generator object; the body runs only as values are requested.
- An iterable can be looped repeatedly; an iterator (including a generator) is consumed once.
- Chaining generators builds memory-light data pipelines over huge files.
- Use
def ... yieldfor multi-line logic and a generator expression(...)for short cases.

