Vectorization vs Loops and apply
The single biggest speed difference in Pandas comes from how you compute a column. Looping row by row, or even using apply, runs Python code once per row. Vectorized operations push the work down into fast compiled code that processes the whole column at once. For large data, the gap is enormous. This lesson shows the hierarchy of approaches and how to recognize when each belongs.
What You'll Learn
- Why vectorized operations are far faster than row loops
- How to rewrite a loop or apply as a vectorized expression
- How to vectorize conditional logic
- When apply is still a reasonable choice
The Speed Hierarchy
From slowest to fastest, the common ways to compute a new column are:
- iterrows loopSlowest
- apply(axis=1)Faster
- VectorizedFastest
The reason is overhead. A loop and apply both call into Python once per row, paying interpreter cost every time. A vectorized expression hands the entire column to optimized array code that runs in a single pass.
The Slow Way: A Row Loop
This works, but it is the pattern to move away from on anything but tiny data.
The Fast Way: Vectorize
The same calculation as a single column expression. Pandas multiplies the two columns element-wise in compiled code.
The result is identical, the code is shorter, and on a million rows it is dramatically faster.
Vectorizing Conditional Logic
A common reason people reach for apply is per-row if/else logic. You usually do not need it. np.where handles a single condition, and np.select handles several.
Both run vectorized across the whole column, replacing what would otherwise be a slow per-row function.
String and Date Operations Vectorize Too
The .str and .dt accessors are vectorized. Use them instead of applying a Python function per row.
When apply Is Still Reasonable
apply is not forbidden. It is a fair choice when:
- The logic genuinely cannot be expressed with vectorized operations or
np.where/np.select. - The data is small enough that speed does not matter.
- You are calling an external function (like a custom parser) that only works on one value at a time.
Even then, prefer applying to a single Series (df['col'].apply(func)) over apply(axis=1) across rows, which is the slowest form.
Exercise: Vectorize a Calculation
Exercise: Vectorize a Condition
Key Points
- Vectorized column expressions run in fast compiled code; row loops and
applyrun Python per row - Rewrite arithmetic as direct column operations (
df['a'] * df['b']) - Use
np.wherefor one condition andnp.selectfor several instead of per-row if/else - The
.strand.dtaccessors are vectorized; prefer them overapply - Reserve
applyfor logic that truly cannot be vectorized, and prefer Series apply overapply(axis=1)

