Large Files and When to Reach Beyond Pandas
Pandas loads a whole file into memory by default, which fails when the file is larger than your available RAM. This lesson covers two strategies: processing a big file in chunks so it never all sits in memory at once, and recognizing when a dataset has outgrown Pandas entirely and a tool like Polars or DuckDB is the better fit.
What You'll Learn
- How to read a large CSV in chunks with
chunksize - How to aggregate across chunks without loading everything
- How to select only the columns and rows you need
- When to consider Polars or DuckDB instead of Pandas
Reading in Chunks
Passing chunksize to read_csv returns an iterator of smaller DataFrames instead of one giant frame. You process each piece and combine the small results, so peak memory stays low. The example below simulates the pattern with an in-memory CSV string.
Each chunk is summarized and discarded, then the small summaries are combined. The full file is never held in memory at once.
Read Only What You Need
Before reaching for chunks, reduce what you load. usecols reads only specific columns, and dtype sets compact types as the data is read, both cutting memory immediately.
Skipping unused columns is the simplest big-file win and often makes chunking unnecessary.
Filtering While Chunking
You can combine chunking with filtering to extract just the rows you care about from a huge file, building one modest result.
When to Reach Beyond Pandas
Chunking and dtype tuning carry Pandas a long way, but some datasets are simply too large or too slow for it. Two popular tools pick up where Pandas leaves off:
Pandas and two common tools for larger or faster workloads
| Criteria | Pandas | Polars | DuckDB |
|---|---|---|---|
| Best for | Data that fits in memory; the standard ecosystem | Fast in-memory work on larger data | SQL over files larger than memory |
| Interface | DataFrame API | DataFrame API (similar feel) | SQL queries |
| Parallelism | Mostly single-core | Multi-core by default | Multi-core, out-of-core |
Pandas
- Best for
- Data that fits in memory; the standard ecosystem
- Interface
- DataFrame API
- Parallelism
- Mostly single-core
Polars
- Best for
- Fast in-memory work on larger data
- Interface
- DataFrame API (similar feel)
- Parallelism
- Multi-core by default
DuckDB
- Best for
- SQL over files larger than memory
- Interface
- SQL queries
- Parallelism
- Multi-core, out-of-core
Signals that it is time to consider an alternative:
- The data does not fit in memory even after dtype tuning and chunking.
- Operations are too slow because Pandas mostly uses a single CPU core.
- You would rather express the work as SQL over files on disk, which is DuckDB's strength.
You do not have to abandon Pandas to use them. Both Polars and DuckDB can read the same CSV and Parquet files and hand results back to Pandas, so a common pattern is to do the heavy filtering and aggregation in DuckDB or Polars, then bring a small result into a Pandas DataFrame for the final steps.
Exercise: Read Selected Columns
Exercise: Sum Across Chunks
Key Points
chunksizereads a file as an iterator of small DataFrames so peak memory stays low- Aggregate or filter each chunk, then combine the small results
usecolsanddtypecut memory at read time and often avoid chunking entirely- Consider Polars for fast multi-core in-memory work, or DuckDB for SQL over files larger than memory
- Polars and DuckDB interoperate with Pandas, so you can offload the heavy step and finish in Pandas

