DISTINCT - Removing Duplicates
Introduction
When querying data, you often get duplicate rows. For example:
-- Get all customer IDs from orders
SELECT customer_id FROM orders;
Result:
customer_id
-------------
1
1 ← Duplicate
1 ← Duplicate
2
3
3 ← Duplicate
If you want only unique values, use DISTINCT!
DISTINCT Syntax
SELECT DISTINCT column1, column2, ...
FROM table
WHERE condition;
Basic DISTINCT Example
-- Unique customer IDs who placed orders
SELECT DISTINCT customer_id
FROM orders;
Result:
customer_id
-------------
1
2
3
Duplicates removed!
DISTINCT on Single Column
Unique Categories
SELECT DISTINCT category
FROM products;
Result:
category
--------------
Electronics
Books
Clothing
Unique Order Dates
SELECT DISTINCT DATE(order_date) AS order_date
FROM orders
ORDER BY order_date DESC;
Result:
order_date
------------
2024-12-20
2024-12-19
2024-12-18
...
DISTINCT on Multiple Columns
When using DISTINCT on multiple columns, it considers the combination of values.
SELECT DISTINCT category, price
FROM products;
Returns unique (category, price) pairs, not unique categories and unique prices separately.
Example
CREATE TABLE products_sample (
name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10,2)
);
INSERT INTO products_sample VALUES
('Laptop A', 'Electronics', 999.99),
('Laptop B', 'Electronics', 999.99), -- Same category + price
('Mouse A', 'Electronics', 25.99),
('Book A', 'Books', 19.99),
('Book B', 'Books', 19.99); -- Same category + price
SELECT DISTINCT category, price
FROM products_sample;
Result:
category | price
--------------+--------
Electronics | 999.99 ← Only one row for (Electronics, 999.99)
Electronics | 25.99
Books | 19.99 ← Only one row for (Books, 19.99)
DISTINCT vs GROUP BY
They Can Produce the Same Result
-- Using DISTINCT
SELECT DISTINCT category
FROM products;
-- Using GROUP BY
SELECT category
FROM products
GROUP BY category;
Both return unique categories.
When to Use Each
Use DISTINCT when:
- You only need unique values
- No aggregation required
- Simpler syntax
Use GROUP BY when:
- You need aggregates (COUNT, SUM, AVG, etc.)
- Grouping with calculations
Example: COUNT with GROUP BY
-- ❌ Can't do this with DISTINCT
SELECT category, COUNT(*) AS product_count
FROM products
GROUP BY category;
-- ✅ DISTINCT doesn't support aggregates
SELECT DISTINCT category
FROM products;
-- No way to add COUNT(*) here!
COUNT(DISTINCT column)
Count unique values in a column:
-- How many unique customers placed orders?
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;
Result:
unique_customers
------------------
87
Multiple COUNT(DISTINCT)
SELECT
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(DISTINCT DATE(order_date)) AS days_with_orders
FROM orders;
Result:
total_orders | unique_customers | days_with_orders
--------------+------------------+------------------
1000 | 87 | 124
DISTINCT with JOINs
Unique Customers Who Ordered
SELECT DISTINCT c.id, c.name
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id;
Alternative (often faster):
SELECT c.id, c.name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders WHERE customer_id = c.id
);
Unique Products Sold
SELECT DISTINCT p.name, p.category
FROM products p
INNER JOIN order_items oi ON p.id = oi.product_id;
Caution with Multiple Joins
-- ❌ May not give expected results
SELECT DISTINCT c.name
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
INNER JOIN order_items oi ON o.id = oi.order_id;
-- Joins can create duplicates; DISTINCT removes them
-- But can be slow on large datasets!
-- ✅ Better: Use EXISTS
SELECT c.name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders WHERE customer_id = c.id
);
DISTINCT in Subqueries
Finding Customers Who Ordered
SELECT name
FROM customers
WHERE id IN (
SELECT DISTINCT customer_id
FROM orders
);
Note: DISTINCT inside IN subquery is often redundant (IN already handles duplicates).
-- These are equivalent:
WHERE id IN (SELECT DISTINCT customer_id FROM orders)
WHERE id IN (SELECT customer_id FROM orders)
DISTINCT with NULL
DISTINCT treats all NULL values as identical.
CREATE TABLE test_nulls (
value INTEGER
);
INSERT INTO test_nulls VALUES (1), (2), (NULL), (NULL), (1);
SELECT DISTINCT value
FROM test_nulls;
Result:
value
-------
1
2
NULL ← Only one NULL
DISTINCT with ORDER BY
Ordering Unique Results
SELECT DISTINCT category
FROM products
ORDER BY category;
Result:
category
--------------
Books
Clothing
Electronics
ORDER BY Non-Selected Columns
-- ❌ Error in most databases
SELECT DISTINCT name
FROM products
ORDER BY price DESC;
-- Error: ORDER BY column must appear in SELECT DISTINCT
-- ✅ Include price in SELECT
SELECT DISTINCT name, price
FROM products
ORDER BY price DESC;
Exception: PostgreSQL allows this if the table has a unique key, but it's not portable.
Performance Considerations
DISTINCT Can Be Expensive
-- Slow on large tables (requires sorting or hashing)
SELECT DISTINCT customer_id
FROM orders; -- 1 million rows
PostgreSQL uses:
- Hash aggregation (faster, more memory)
- Sort-based (slower, less memory)
DISTINCT vs EXISTS
-- ❌ Slower: DISTINCT after JOIN
SELECT DISTINCT c.name
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id;
-- ✅ Faster: EXISTS
SELECT c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id);
DISTINCT vs GROUP BY Performance
-- Usually similar performance
SELECT DISTINCT category FROM products;
SELECT category FROM products GROUP BY category;
Check with EXPLAIN ANALYZE for your specific query!
DISTINCT ON (PostgreSQL-Specific)
DISTINCT ON returns the first row for each unique value of specified columns.
Syntax
SELECT DISTINCT ON (column1, column2)
column1, column2, other_columns
FROM table
ORDER BY column1, column2, sort_column;
Example: Latest Order per Customer
SELECT DISTINCT ON (customer_id)
customer_id,
order_date,
total
FROM orders
ORDER BY customer_id, order_date DESC;
Result:
customer_id | order_date | total
-------------+------------+--------
1 | 2024-12-20 | 157.50 ← Most recent for customer 1
2 | 2024-12-18 | 87.25 ← Most recent for customer 2
3 | 2024-12-15 | 247.00 ← Most recent for customer 3
How it works:
- Rows are sorted by
customer_id, order_date DESC - For each
customer_id, only the first row (most recent) is returned
Another Example: Cheapest Product per Category
SELECT DISTINCT ON (category)
category,
name,
price
FROM products
ORDER BY category, price ASC;
Result:
category | name | price
--------------+---------------+-------
Books | SQL Basics | 9.99 ← Cheapest book
Clothing | T-Shirt | 14.99 ← Cheapest clothing
Electronics | USB Cable | 5.99 ← Cheapest electronics
Note: DISTINCT ON is PostgreSQL-specific. Other databases use ROW_NUMBER() or subqueries.
DISTINCT ON Alternative (Portable)
-- Works in all databases
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders
)
SELECT customer_id, order_date, total
FROM ranked
WHERE rn = 1;
Common DISTINCT Patterns
Pattern 1: Unique Values Report
-- How many unique values in each column?
SELECT
COUNT(DISTINCT category) AS unique_categories,
COUNT(DISTINCT name) AS unique_names,
COUNT(DISTINCT price) AS unique_prices
FROM products;
Pattern 2: Data Quality Check
-- Find duplicate names
SELECT name, COUNT(*)
FROM products
GROUP BY name
HAVING COUNT(*) > 1;
Pattern 3: Active Users
-- Unique users who logged in last 30 days
SELECT DISTINCT user_id
FROM login_logs
WHERE login_date >= CURRENT_DATE - INTERVAL '30 days';
Pattern 4: Product Variety per Order
-- How many distinct products in each order?
SELECT
order_id,
COUNT(DISTINCT product_id) AS unique_products
FROM order_items
GROUP BY order_id;
DISTINCT vs UNION
UNION Automatically Removes Duplicates
SELECT name FROM customers WHERE state = 'CA'
UNION
SELECT name FROM customers WHERE state = 'NY';
-- Duplicates removed automatically!
UNION ALL Keeps Duplicates
SELECT name FROM customers WHERE state = 'CA'
UNION ALL
SELECT name FROM customers WHERE state = 'NY';
-- Duplicates kept (faster than UNION)
Common Mistakes
Mistake 1: DISTINCT on All Columns When Not Needed
-- ❌ Redundant if id is unique
SELECT DISTINCT id, name, email
FROM customers;
-- ✅ Just SELECT (id is already unique)
SELECT id, name, email
FROM customers;
Mistake 2: Using DISTINCT to Fix Bad JOINs
-- ❌ Bad: Using DISTINCT to hide JOIN issues
SELECT DISTINCT c.name
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
INNER JOIN order_items oi ON o.id = oi.order_id;
-- Better to understand why duplicates occur!
-- ✅ Better: Use appropriate JOIN or EXISTS
SELECT c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id);
Mistake 3: DISTINCT with Aggregates
-- ❌ Doesn't make sense
SELECT DISTINCT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- GROUP BY already ensures unique customer_id!
-- ✅ No DISTINCT needed
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
Practice Exercises
Exercise 1
Find all unique categories in the products table.
Solution
SELECT DISTINCT category
FROM products
ORDER BY category;
Exercise 2
Count how many unique customers placed orders.
Solution
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders;
Exercise 3
Find customers who placed orders, showing each customer only once.
Solution
-- Option 1: DISTINCT
SELECT DISTINCT c.id, c.name
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
ORDER BY c.name;
-- Option 2: EXISTS (often faster)
SELECT c.id, c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id)
ORDER BY c.name;
Exercise 4
For each customer, find their most recent order (PostgreSQL DISTINCT ON).
Solution
-- PostgreSQL DISTINCT ON
SELECT DISTINCT ON (customer_id)
customer_id,
order_date,
total
FROM orders
ORDER BY customer_id, order_date DESC;
-- Portable solution (all databases)
WITH ranked AS (
SELECT
customer_id,
order_date,
total,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders
)
SELECT customer_id, order_date, total
FROM ranked
WHERE rn = 1;
Key Takeaways
- ✅ DISTINCT removes duplicate rows from result sets
- ✅ Works on single or multiple columns (combination of values)
- ✅ COUNT(DISTINCT column) counts unique values
- ✅ DISTINCT vs GROUP BY: DISTINCT for unique values, GROUP BY for aggregates
- ✅ Can be expensive on large datasets (requires sorting/hashing)
- ✅ EXISTS often faster than DISTINCT with JOINs
- ✅ DISTINCT ON (PostgreSQL) returns first row per group
- ✅ All NULLs treated as identical by DISTINCT
- ✅ UNION removes duplicates, UNION ALL keeps them
Next Steps
You've mastered DISTINCT! In the next lesson, we'll learn LIMIT and OFFSET—controlling result set size and implementing pagination for efficient data retrieval.
Almost done with Module 6!

