Categorical Data Types
When a text column has only a handful of distinct values repeated many times, storing it as plain strings wastes memory and slows operations. The category dtype stores each unique value once and represents every row as a small integer code pointing to it. For columns like country, status, or product tier, this is one of the easiest performance wins in Pandas.
What You'll Learn
- When a column is a good candidate for the category dtype
- How to convert a column to category and measure the savings
- How ordered categories enable comparison and sorting
- The tradeoffs to keep in mind
Spotting a Good Candidate
A column is a good fit for category when it has low cardinality, meaning few unique values relative to the number of rows. Use nunique against the row count to judge.
Six rows per unique value and only three distinct statuses make this an ideal candidate.
Converting and Measuring Savings
Convert with astype('category'). Then compare memory usage before and after with memory_usage(deep=True), which counts the full cost of Python string objects.
The category version stores three strings once and a compact array of integer codes, which is far smaller than thousands of repeated string objects.
Ordered Categories
Some categories have a natural order, such as size (small, medium, large) or rating. Declaring the order lets you compare and sort logically instead of alphabetically.
With an ordered category, df['size'] > 'small' is meaningful, which a plain string column cannot do.
Categories Work With GroupBy
Grouping on a category column is fast and, by default, reports every defined category even if some have no rows in the current data, which keeps reports consistent.
Tradeoffs to Keep in Mind
The category dtype is not free. Keep these in mind:
- It only saves memory when cardinality is low; a near-unique column (like an ID) can use more memory as a category.
- Adding a value outside the known categories requires registering it first, so highly dynamic columns are awkward.
- Some string operations need the values converted back, though most common ones work directly.
Exercise: Convert to Category
Exercise: Ordered Category
Key Points
- The
categorydtype stores each unique value once with compact integer codes - It shines on low-cardinality columns like status, country, or tier
- Measure savings with
memory_usage(deep=True)before and after converting - Ordered categories enable logical comparison and sorting
- Avoid category on near-unique columns; it can use more memory than plain strings

