Collision Resolution Techniques (Chaining, Open Addressing)

Collision Resolution Strategies

In the previous topic, we learned that a collision occurs when two different keys, after being processed by the hash function, map to the same index (or "bucket") in the hash table's underlying array. While a good hash function aims to minimize collisions, they are mathematically unavoidable in most practical scenarios.

Unhandled collisions can drastically degrade a hash table's performance from its ideal O(1) average-case to O(N) in the worst-case (where N is the number of elements), essentially turning it into a linked list or an array that requires linear searching. Therefore, effective collision resolution strategies are vital for maintaining hash table efficiency.

There are two primary categories of collision resolution: Chaining and Open Addressing.

1. Chaining (Open Hashing)

Chaining is the most straightforward and commonly used method for collision resolution.

Concept: Instead of storing a single key-value pair directly at each array index, each index (or "bucket") of the hash table stores a pointer to a linked list (or sometimes another dynamic data structure like an array or a balanced binary search tree, though a linked list is most common). When a collision occurs, the new key-value pair is simply added to the end of the linked list at the computed index.

Illustration:

Hash Table Array:
Index 0: -> Node (Key: "apple", Value: 10)
Index 1: -> Node (Key: "banana", Value: 20) -> Node (Key: "orange", Value: 30) (collision occurred here for "orange")
Index 2: -> Node (Key: "grape", Value: 40)
Index 3: -> null

Operations:
- Insertion (put/set): Calculate the hash index. Add the new key-value pair to the linked list at that index.
- Retrieval (get): Calculate the hash index. Traverse the linked list at that index, comparing the key of each node until a match is found.
- Deletion (delete): Calculate the hash index. Traverse the linked list at that index, find the matching key, and remove the node from the list.
Pros:
- Simple to implement.
- Handles many collisions gracefully: The table never "fills up" as long as there's memory for new nodes. It can even hold more elements than the number of buckets.
- Less sensitive to hash function quality: While a bad hash function degrades performance, it won't cause the table to fail or require complex rehashing as frequently.
- Deletion is straightforward (standard linked list deletion).
Cons:
- Memory Overhead: Each node in the linked list requires extra memory for its pointer.
- Worst-Case Performance: If all keys hash to the same bucket (due to a terrible hash function or malicious input), the hash table degrades to a simple linked list, making all operations O(N).

2. Open Addressing (Closed Hashing)

Open addressing resolves collisions by finding an alternative empty slot within the hash table itself, rather than creating an external data structure.

Concept: If the calculated hash index is already occupied, the algorithm "probes" (searches) for the next available empty slot in the array using a predefined sequence.
Requirement: For open addressing to work, the hash table's underlying array must always have empty slots; typically, its load factor (number of elements / number of buckets) should be less than 1 (e.g., usually kept below 0.7 or 0.8 to ensure good performance and available slots).

There are several common probing strategies:

Linear Probing:
- Concept: When a collision occurs at index i, it sequentially checks (i + 1) % arraySize, then (i + 2) % arraySize, and so on, until an empty slot is found.
- Problem: Leads to primary clustering, where groups of occupied slots form long "runs." This means that even if the hash function distributes keys well, a collision can start a long sequence of probes, making subsequent insertions and lookups slower for other keys that happen to hash into or near that cluster.
Quadratic Probing:
- Concept: When a collision occurs at index i, it probes (i + 1^2) % arraySize, then (i + 2^2) % arraySize, (i + 3^2) % arraySize, and so on.
- Problem: Helps alleviate primary clustering but can lead to secondary clustering, where keys that hash to the same initial index will follow the same probe sequence.
Double Hashing:
- Concept: Uses a second independent hash function to determine the step size for probing. If a collision occurs at index i, it probes (i + hash2(key)) % arraySize, then (i + 2 * hash2(key)) % arraySize, etc.
- Benefit: Greatly reduces both primary and secondary clustering, offering the best performance among open addressing schemes in many cases.

Illustration (Linear Probing):

Array (Capacity 5): [ _ , _ , _ , _ , _ ]

Insert("apple", index 1): [ _ , "apple" , _ , _ , _ ]
Insert("banana", index 3): [ _ , "apple" , _ , "banana" , _ ]
Insert("orange", index 1): (Collision at 1) Probe to 2 -> [ _ , "apple" , "orange" , "banana" , _ ]

Operations:
- Insertion (put/set): Calculate the hash index. If occupied, follow the probe sequence until an empty slot is found. Insert the key-value pair there.
- Retrieval (get): Calculate the hash index. Follow the same probe sequence. If the key is found, return its value. If an empty slot is encountered before the key, the key is not in the table.
- Deletion (delete): This is tricky. Simply removing an element can break the probe chain for other elements. Instead, the slot is often marked as "deleted" (a tombstone marker). This slot can be used for future insertions but is skipped during lookups.
Pros:
- Better Cache Performance: Data is stored contiguously in memory, which can lead to better cache utilization.
- No Pointer Overhead: No extra memory needed for linked list nodes.
Cons:
- Fixed Size Issue: Requires resizing (rehashing) when the load factor gets too high.
- Sensitive to Load Factor: Performance degrades rapidly as the table fills up due to increased probing.
- Complex Deletion: Requires "tombstone" markers, which adds complexity and can degrade lookup performance over time.
- Clustering: Can suffer from various forms of clustering if not handled with sophisticated probing (like double hashing).

3. Load Factor and Rehashing

Regardless of the collision resolution strategy, the load factor is a crucial metric for hash table performance.

Definition: Load Factor (α) = (Number of Elements (N)) / (Number of Buckets (M))
Importance:
- A higher load factor means more elements per bucket on average, leading to more collisions and longer probe sequences (for open addressing) or longer linked lists (for chaining).
- For chaining, load factors can exceed 1, but performance starts to degrade noticeably above 0.7-1.0.
- For open addressing, load factor must be less than 1 (you need empty slots to probe into), and performance often degrades significantly above 0.5-0.7.
Resizing (Rehashing): To maintain efficient O(1) average-case performance, when the load factor exceeds a certain threshold (e.g., 0.7 for chaining, 0.5 for open addressing), the hash table undergoes rehashing:
1. A new, larger array (typically double the size) is created.
2. All existing key-value pairs from the old table are re-inserted into the new table (because the arraySize has changed, their hash indices will likely change).
3. The old table is discarded.

Rehashing is an O(N) operation (where N is the number of elements), but since it happens relatively infrequently, the amortized average time complexity for operations remains O(1).

Key Takeaway: Chaining uses external lists at each bucket and is robust for high load factors. Open addressing finds alternative slots within the array via probing but is sensitive to load factor and deletion complexity. Both methods rely on careful load factor management and rehashing to ensure efficient O(1) average-case performance.