Generators and Iterators for Big Data

When a dataset is too big to fit in memory, or when you only need values one at a time, generators are the answer. They produce values lazily, on demand, instead of building a giant list up front. This is one of the most practical advanced-Python skills for data and AI work, where files routinely have millions of rows.

What You'll Learn

What "lazy evaluation" means and why it matters
How to write a generator with yield
The difference between an iterable and an iterator
How to chain generators into a memory-light pipeline

The Problem: Loading Everything at Once

Reading a huge file into a list loads every row into memory at the same time:

def read_all(path):
    rows = []
    with open(path) as f:
        for line in f:
            rows.append(line.strip())
    return rows           # the entire file now lives in memory

If the file is 5 GB, this crashes. A generator processes one line at a time and never holds more than one in memory.

Writing a Generator with `yield`

Replace return with yield and the function becomes a generator. Each yield hands back one value and pauses until the next value is requested.

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

for line in read_lines("big.log"):
    process(line)         # one line in memory at a time

Calling read_lines(...) does not run the body immediately. It returns a generator object. The code runs a little at a time, each time the loop asks for the next value. That is lazy evaluation.

Iterable vs Iterator

This distinction clears up a lot of confusion:

A generator is an iterator: it yields each value once, then it's spent.

A generator is an iterator: it yields each value once, then it's spent.
Criteria	Iterable	Iterator
What it is	Something you can loop over	The thing that produces values one by one
Examples	list, tuple, dict, str, file	result of iter(...), a generator
Can exhaust	No, loop it again	Yes, once consumed it's empty

Iterable

What it is: Something you can loop over
Examples: list, tuple, dict, str, file
Can exhaust: No, loop it again

Iterator

What it is: The thing that produces values one by one
Examples: result of iter(...), a generator
Can exhaust: Yes, once consumed it's empty

A key gotcha: a generator is consumed once. After you loop over it, it is empty. If you need the values twice, either re-create the generator or collect them into a list with list(gen).

Building a Pipeline

The real power is chaining generators. Each stage stays lazy, so even a multi-step transform over a massive file uses almost no memory.

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

def only_errors(lines):
    for line in lines:
        if "ERROR" in line:
            yield line

def to_upper(lines):
    for line in lines:
        yield line.upper()

# Nothing runs until we consume the final loop:
pipeline = to_upper(only_errors(read_lines("app.log")))
for line in pipeline:
    print(line)

read_linesone line at a time
only_errorskeep ERROR lines
to_uppertransform
consumeprint / sum / save

Generator Expressions Recap

You met these last lesson. The parenthesized form is just an inline generator:

squares = (x * x for x in range(1_000_000))   # lazy, no big list
total = sum(squares)

Use the def ... yield form when the logic needs multiple lines or its own name; use the expression form for short one-liners.

Try It

Run this. The countdown generator yields values lazily. Then add a second generator that doubles each value before printing.

Loading Python Playground...

Key Takeaways

Generators produce values lazily with yield, holding one value at a time instead of a whole list.
Calling a generator function returns a generator object; the body runs only as values are requested.
An iterable can be looped repeatedly; an iterator (including a generator) is consumed once.
Chaining generators builds memory-light data pipelines over huge files.
Use def ... yield for multi-line logic and a generator expression (...) for short cases.

Generators and Iterators for Big Data

What You'll Learn

What "lazy evaluation" means and why it matters
How to write a generator with yield
The difference between an iterable and an iterator
How to chain generators into a memory-light pipeline

The Problem: Loading Everything at Once

Reading a huge file into a list loads every row into memory at the same time:

def read_all(path):
    rows = []
    with open(path) as f:
        for line in f:
            rows.append(line.strip())
    return rows           # the entire file now lives in memory

If the file is 5 GB, this crashes. A generator processes one line at a time and never holds more than one in memory.

Writing a Generator with `yield`

Replace return with yield and the function becomes a generator. Each yield hands back one value and pauses until the next value is requested.

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

for line in read_lines("big.log"):
    process(line)         # one line in memory at a time

Calling read_lines(...) does not run the body immediately. It returns a generator object. The code runs a little at a time, each time the loop asks for the next value. That is lazy evaluation.

Iterable vs Iterator

This distinction clears up a lot of confusion:

A generator is an iterator: it yields each value once, then it's spent.

A generator is an iterator: it yields each value once, then it's spent.
Criteria	Iterable	Iterator
What it is	Something you can loop over	The thing that produces values one by one
Examples	list, tuple, dict, str, file	result of iter(...), a generator
Can exhaust	No, loop it again	Yes, once consumed it's empty

Iterable

What it is: Something you can loop over
Examples: list, tuple, dict, str, file
Can exhaust: No, loop it again

Iterator

What it is: The thing that produces values one by one
Examples: result of iter(...), a generator
Can exhaust: Yes, once consumed it's empty

A key gotcha: a generator is consumed once. After you loop over it, it is empty. If you need the values twice, either re-create the generator or collect them into a list with list(gen).

Building a Pipeline

The real power is chaining generators. Each stage stays lazy, so even a multi-step transform over a massive file uses almost no memory.

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

def only_errors(lines):
    for line in lines:
        if "ERROR" in line:
            yield line

def to_upper(lines):
    for line in lines:
        yield line.upper()

# Nothing runs until we consume the final loop:
pipeline = to_upper(only_errors(read_lines("app.log")))
for line in pipeline:
    print(line)

read_linesone line at a time
only_errorskeep ERROR lines
to_uppertransform
consumeprint / sum / save

Generator Expressions Recap

You met these last lesson. The parenthesized form is just an inline generator:

squares = (x * x for x in range(1_000_000))   # lazy, no big list
total = sum(squares)

Use the def ... yield form when the logic needs multiple lines or its own name; use the expression form for short one-liners.

Try It

Run this. The countdown generator yields values lazily. Then add a second generator that doubles each value before printing.

Loading Python Playground...

Key Takeaways

Generators produce values lazily with yield, holding one value at a time instead of a whole list.
Calling a generator function returns a generator object; the body runs only as values are requested.
An iterable can be looped repeatedly; an iterator (including a generator) is consumed once.
Chaining generators builds memory-light data pipelines over huge files.
Use def ... yield for multi-line logic and a generator expression (...) for short cases.

Generators and Iterators for Big Data

What You'll Learn

The Problem: Loading Everything at Once

Writing a Generator with `yield`

Iterable vs Iterator

Iterable

Iterator

Building a Pipeline

Generator Expressions Recap

Try It

Key Takeaways

Quiz

Questions & Answers

Generators and Iterators for Big Data

What You'll Learn

The Problem: Loading Everything at Once

Writing a Generator with `yield`

Iterable vs Iterator

Iterable

Iterator

Building a Pipeline

Generator Expressions Recap

Try It

Key Takeaways

Quiz

Questions & Answers

Generators and Iterators for Big Data

What You'll Learn

The Problem: Loading Everything at Once

Writing a Generator with yield

Iterable vs Iterator

Iterable

Iterator

Building a Pipeline

Generator Expressions Recap

Try It

Key Takeaways

Quiz

Questions & Answers

Generators and Iterators for Big Data

What You'll Learn

The Problem: Loading Everything at Once

Writing a Generator with yield

Iterable vs Iterator

Iterable

Iterator

Building a Pipeline

Generator Expressions Recap

Try It

Key Takeaways

Quiz

Questions & Answers

Writing a Generator with `yield`

Writing a Generator with `yield`