Hash Tables
Key-value structure enabling expected O(1) operations.
1. Hash functions and properties
The foundation of hash table performance:
What is a hash function?
- Definition: Function that maps keys to array indices
- Signature:
hash(key) → integer in range [0, table_size-1] - Goal: Distribute keys uniformly across table
Good hash function properties:
- Deterministic: Same key always produces same hash
- Uniform distribution: Keys spread evenly across table
- Fast to compute: O(1) or O(k) where k = key length
- Minimize collisions: Different keys rarely map to same index
- Avalanche effect: Small key change → large hash change
Common hash functions:
1. Division method:
def hash_division(key, table_size):
return key % table_size
# Best when table_size is prime
# Example: key=42, size=13 → 42 % 13 = 3
2. Multiplication method:
def hash_multiplication(key, table_size):
A = 0.6180339887 # (√5 - 1) / 2, golden ratio
return int(table_size * ((key * A) % 1))
# Works well with any table size
3. Universal hashing:
import random
class UniversalHash:
def __init__(self, table_size):
self.table_size = table_size
self.p = 2**31 - 1 # Large prime
self.a = random.randint(1, self.p - 1)
self.b = random.randint(0, self.p - 1)
def hash(self, key):
return ((self.a * key + self.b) % self.p) % self.table_size
4. String hashing (polynomial rolling hash):
def hash_string(s, table_size):
hash_val = 0
p = 31 # Prime base
p_pow = 1
for char in s:
hash_val = (hash_val + ord(char) * p_pow) % table_size
p_pow = (p_pow * p) % table_size
return hash_val
# Example: "hello" → hash value
5. Python’s built-in hash:
# For integers
hash(42) % table_size
# For strings
hash("hello") % table_size
# For tuples (immutable)
hash((1, 2, 3)) % table_size
# Note: Python's hash is randomized per session for security
Handling non-integer keys:
def hash_any_key(key, table_size):
# Convert to integer first
if isinstance(key, str):
return hash_string(key, table_size)
elif isinstance(key, int):
return key % table_size
elif isinstance(key, tuple):
# Combine hashes of elements
h = 0
for item in key:
h = (h * 31 + hash(item)) % table_size
return h
else:
return hash(key) % table_size
Bad hash function example:
# BAD: Only uses first character
def bad_hash(s, table_size):
return ord(s[0]) % table_size
# Problem: "apple", "ant", "arrow" all hash to same value
# Causes many collisions!
2. Load factor and resizing
Managing table capacity and performance:
Load factor (α):
- Definition: α = n / m where n = elements, m = table size
- Meaning: Average number of elements per slot
- Impact: Higher α → more collisions → slower operations
Typical thresholds:
- Chaining: Resize when α > 0.75 (Python dict uses 0.67)
- Open addressing: Resize when α > 0.5 (to avoid clustering)
Resizing strategy:
class HashTable:
def __init__(self, initial_size=16):
self.size = initial_size
self.count = 0
self.table = [[] for _ in range(self.size)] # Chaining
def load_factor(self):
return self.count / self.size
def resize(self):
# Double the size (use next prime for better distribution)
old_table = self.table
self.size *= 2
self.table = [[] for _ in range(self.size)]
self.count = 0
# Rehash all elements
for bucket in old_table:
for key, value in bucket:
self.insert(key, value)
def insert(self, key, value):
if self.load_factor() > 0.75:
self.resize()
index = hash(key) % self.size
# Insert logic here
self.count += 1
Amortized analysis of resizing:
Operations: n insertions
Resizes at: 1, 2, 4, 8, 16, ..., n
Total rehashing cost: 1 + 2 + 4 + ... + n = 2n - 1 = O(n)
Amortized cost per insertion: O(n) / n = O(1)
Shrinking strategy:
def delete(self, key):
# Delete logic here
self.count -= 1
# Shrink if load factor too low (avoid wasting space)
if self.count > 0 and self.load_factor() < 0.25:
self.size //= 2
self.rehash()
Choosing table size:
- Prime numbers: Better distribution, fewer collisions
- Powers of 2: Faster modulo (use bitwise AND)
- Trade-off: Primes better for quality, powers of 2 faster
# Prime table sizes
PRIMES = [53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593]
# Power of 2 with bitmask
def hash_power_of_2(key, size):
return hash(key) & (size - 1) # Equivalent to % size when size is power of 2
3. Collision strategies: chaining vs open addressing
Two main approaches when multiple keys hash to same index:
Separate Chaining:
- Idea: Each table slot stores a linked list (or dynamic array) of colliding elements
- Collision: Add to list at that index
Implementation:
class HashTableChaining:
def __init__(self, size=10):
self.size = size
self.table = [[] for _ in range(size)]
def insert(self, key, value):
index = hash(key) % self.size
bucket = self.table[index]
# Update if key exists
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
# Add new key-value pair
bucket.append((key, value))
def search(self, key):
index = hash(key) % self.size
bucket = self.table[index]
for k, v in bucket:
if k == key:
return v
return None
def delete(self, key):
index = hash(key) % self.size
bucket = self.table[index]
for i, (k, v) in enumerate(bucket):
if k == key:
del bucket[i]
return True
return False
Advantages:
- ✓ Simple to implement
- ✓ Never fills up (can exceed load factor 1.0)
- ✓ Deletion is straightforward
- ✓ Less sensitive to hash function quality
Disadvantages:
- ✗ Extra memory for pointers/lists
- ✗ Poor cache locality (pointer chasing)
- ✗ Performance degrades gradually with high load
Open Addressing:
- Idea: All elements stored in table itself, no external structures
- Collision: Probe for next available slot using a sequence
General structure:
class HashTableOpenAddressing:
def __init__(self, size=10):
self.size = size
self.table = [None] * size
self.count = 0
def probe_sequence(self, key, i):
# Override in subclasses
raise NotImplementedError
def insert(self, key, value):
for i in range(self.size):
index = self.probe_sequence(key, i)
if self.table[index] is None or self.table[index] == "DELETED":
self.table[index] = (key, value)
self.count += 1
return True
return False # Table full
def search(self, key):
for i in range(self.size):
index = self.probe_sequence(key, i)
if self.table[index] is None:
return None # Not found
if self.table[index] != "DELETED" and self.table[index][0] == key:
return self.table[index][1]
return None
Advantages:
- ✓ Better cache locality (array access)
- ✓ No extra memory for pointers
- ✓ Faster for small tables (fits in cache)
Disadvantages:
- ✗ Must maintain load factor < 0.5-0.7
- ✗ Deletion is complex (need tombstones)
- ✗ Performance degrades sharply near capacity
- ✗ Clustering issues
Comparison:
| Feature | Chaining | Open Addressing |
|---|---|---|
| Memory | More (pointers) | Less (no pointers) |
| Cache locality | Poor | Good |
| Load factor | Can exceed 1.0 | Must stay < 0.7 |
| Deletion | Simple | Complex (tombstones) |
| Clustering | No | Yes |
| Implementation | Easier | Harder |
| Best for | General use | Cache-friendly, known size |
4. Linear probing, quadratic probing, double hashing
Different probe sequences for open addressing:
Linear Probing:
- Formula:
h(k, i) = (h(k) + i) % m - Sequence: Try h(k), h(k)+1, h(k)+2, …
Implementation:
class LinearProbing(HashTableOpenAddressing):
def probe_sequence(self, key, i):
return (hash(key) + i) % self.size
Example:
Table size: 10
Insert key with hash(key) = 3
Probe sequence: 3, 4, 5, 6, 7, 8, 9, 0, 1, 2
Advantages:
- ✓ Simple to implement
- ✓ Good cache locality (sequential access)
Disadvantages:
- ✗ Primary clustering: Long runs of occupied slots
- ✗ Performance degrades with high load factor
Primary clustering example:
Table: [_, _, X, X, X, X, _, _, _, _]
Cluster forms here ↑
New insertion at 2 → probes 2,3,4,5,6 → inserts at 6
Cluster grows longer!
Quadratic Probing:
- Formula:
h(k, i) = (h(k) + c₁*i + c₂*i²) % m - Common:
h(k, i) = (h(k) + i²) % m
Implementation:
class QuadraticProbing(HashTableOpenAddressing):
def probe_sequence(self, key, i):
return (hash(key) + i * i) % self.size
Example:
Table size: 10
Insert key with hash(key) = 3
Probe sequence: 3, 4, 7, 2, 9, 8, 9, 2, ... (wraps around)
Offsets: 0, 1, 4, 9, 16, 25, ...
Advantages:
- ✓ Reduces primary clustering
- ✓ Better distribution than linear
Disadvantages:
- ✗ Secondary clustering: Keys with same hash follow same probe sequence
- ✗ May not probe all slots (need table size to be prime or power of 2)
Secondary clustering example:
Keys k1, k2 with hash(k1) = hash(k2) = 5
Both follow same probe sequence: 5, 6, 9, 4, 1, ...
Double Hashing:
- Formula:
h(k, i) = (h₁(k) + i * h₂(k)) % m - Two hash functions: h₁ for initial position, h₂ for step size
Implementation:
class DoubleHashing(HashTableOpenAddressing):
def __init__(self, size=10):
super().__init__(size)
self.prime = self._prev_prime(size)
def _prev_prime(self, n):
# Find largest prime < n
for p in range(n-1, 1, -1):
if self._is_prime(p):
return p
return 2
def _is_prime(self, n):
if n < 2:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
def hash1(self, key):
return hash(key) % self.size
def hash2(self, key):
# h2(k) must be non-zero and coprime with table size
return self.prime - (hash(key) % self.prime)
def probe_sequence(self, key, i):
return (self.hash1(key) + i * self.hash2(key)) % self.size
Example:
Table size: 10
h1(k) = 3, h2(k) = 7
Probe sequence: 3, 0, 7, 4, 1, 8, 5, 2, 9, 6
Offsets: 0, 7, 14, 21, 28, ...
Advantages:
- ✓ Eliminates both primary and secondary clustering
- ✓ Best distribution among open addressing methods
- ✓ Probes entire table (if h2 and m are coprime)
Disadvantages:
- ✗ More complex to implement
- ✗ Two hash function computations
- ✗ Slightly slower per probe
Comparison:
| Method | Clustering | Complexity | Distribution |
|---|---|---|---|
| Linear | Primary | Simple | Poor |
| Quadratic | Secondary | Moderate | Good |
| Double Hashing | None | Complex | Excellent |
5. Deletion nuances in open addressing
Handling deletions without breaking probe sequences:
The problem:
# Initial state
table = [None, (k1, v1), (k2, v2), None, ...]
# k1 and k2 hash to index 1, k2 probed to index 2
# Delete k1
table[1] = None # WRONG!
# Now search for k2
# Probe: index 1 → None → stop searching
# k2 is lost even though it's at index 2!
Solution: Tombstones (lazy deletion):
class HashTableWithDeletion:
DELETED = object() # Sentinel value
def __init__(self, size=10):
self.size = size
self.table = [None] * size
def insert(self, key, value):
for i in range(self.size):
index = (hash(key) + i) % self.size
# Can insert at None or DELETED slot
if self.table[index] is None or self.table[index] is self.DELETED:
self.table[index] = (key, value)
return True
# Update existing key
if self.table[index][0] == key:
self.table[index] = (key, value)
return True
return False
def search(self, key):
for i in range(self.size):
index = (hash(key) + i) % self.size
if self.table[index] is None:
return None # Not found
# Skip DELETED, continue probing
if self.table[index] is not self.DELETED:
if self.table[index][0] == key:
return self.table[index][1]
return None
def delete(self, key):
for i in range(self.size):
index = (hash(key) + i) % self.size
if self.table[index] is None:
return False # Not found
if self.table[index] is not self.DELETED:
if self.table[index][0] == key:
self.table[index] = self.DELETED
return True
return False
Tombstone issues:
- Space accumulation: DELETED markers take up space
- Performance degradation: Must probe through tombstones
- Load factor: Should count tombstones in load calculation?
Solutions:
def needs_rehash(self):
# Rehash if too many tombstones
active = sum(1 for slot in self.table
if slot is not None and slot is not self.DELETED)
deleted = sum(1 for slot in self.table if slot is self.DELETED)
return deleted > self.size * 0.25 or active / self.size > 0.7
def rehash(self):
old_table = self.table
self.table = [None] * self.size
for slot in old_table:
if slot is not None and slot is not self.DELETED:
key, value = slot
self.insert(key, value)
Alternative: Robin Hood hashing:
- Minimize variance in probe lengths
- When inserting, “steal” from rich (short probe) to give to poor (long probe)
- Improves worst-case performance
6. Applications: caches, dictionaries
Real-world uses of hash tables:
Dictionaries / Maps:
# Python dict (hash table implementation)
phonebook = {
"Alice": "555-1234",
"Bob": "555-5678",
"Charlie": "555-9012"
}
# O(1) average case operations
phonebook["Alice"] # Lookup
phonebook["David"] = "555-3456" # Insert
del phonebook["Bob"] # Delete
"Charlie" in phonebook # Membership test
Caches (LRU Cache):
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key):
if key not in self.cache:
return -1
# Move to end (most recently used)
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
# Remove least recently used (first item)
self.cache.popitem(last=False)
Sets (unique elements):
# Python set (hash table for keys only)
seen = set()
for item in data:
if item not in seen:
seen.add(item)
process(item)
Frequency counting:
from collections import Counter
# Count word frequencies
text = "the quick brown fox jumps over the lazy dog"
word_count = Counter(text.split())
# Counter({'the': 2, 'quick': 1, 'brown': 1, ...})
Database indexing:
# Hash index for fast lookups
class HashIndex:
def __init__(self):
self.index = {}
def insert(self, key, row_id):
if key not in self.index:
self.index[key] = []
self.index[key].append(row_id)
def lookup(self, key):
return self.index.get(key, [])
Deduplication:
def remove_duplicates(arr):
seen = set()
result = []
for item in arr:
if item not in seen:
seen.add(item)
result.append(item)
return result
Two Sum problem:
def two_sum(nums, target):
seen = {}
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
return None
7. Complexity and pitfalls
Understanding performance and common mistakes:
Time complexity:
| Operation | Average | Worst Case | Notes |
|---|---|---|---|
| Insert | O(1) | O(n) | Worst: all keys collide |
| Search | O(1) | O(n) | Worst: all keys collide |
| Delete | O(1) | O(n) | Worst: all keys collide |
| Resize | O(n) | O(n) | Amortized O(1) per insert |
Space complexity:
- Chaining: O(n + m) where n = elements, m = table size
- Open addressing: O(m), must keep load factor low
Common pitfalls:
1. Using mutable keys:
# BAD: Lists are mutable, can't be hashed
d = {[1, 2]: "value"} # TypeError!
# GOOD: Use immutable types
d = {(1, 2): "value"} # Tuples work
d = {frozenset([1, 2]): "value"} # Frozen sets work
2. Poor hash function:
# BAD: All strings hash to same value
def bad_hash(s):
return len(s) % 10
# All 5-letter words collide!
3. Ignoring load factor:
# BAD: Never resize
class BadHashTable:
def insert(self, key, value):
# Just keep inserting...
# Performance degrades to O(n)!
pass
4. Forgetting tombstones in open addressing:
# BAD: Delete by setting to None
def delete(self, key):
index = self.find(key)
self.table[index] = None # Breaks probe sequences!
5. Hash flooding attack:
# Attacker sends keys that all hash to same value
# Causes O(n) operations, DoS attack
# Solution: Use randomized hash functions (Python 3.3+)
Performance tips:
- Choose good initial size: Avoid early resizes
- Use prime table sizes: Better distribution
- Monitor load factor: Resize before performance degrades
- Profile hash function: Ensure uniform distribution
- Consider alternatives: For ordered data, use trees
When NOT to use hash tables:
- ✗ Need ordered iteration (use BST or sorted array)
- ✗ Need range queries (use BST or B-tree)
- ✗ Need minimum/maximum (use heap)
- ✗ Memory constrained (hash tables waste space)
- ✗ Worst-case guarantees needed (use balanced trees)