Hash Tables

Key-value structure enabling expected O(1) operations.

1. Hash functions and properties

The foundation of hash table performance:

What is a hash function?

Definition: Function that maps keys to array indices
Signature: hash(key) → integer in range [0, table_size-1]
Goal: Distribute keys uniformly across table

Good hash function properties:

Deterministic: Same key always produces same hash
Uniform distribution: Keys spread evenly across table
Fast to compute: O(1) or O(k) where k = key length
Minimize collisions: Different keys rarely map to same index
Avalanche effect: Small key change → large hash change

Common hash functions:

1. Division method:

def hash_division(key, table_size):
    return key % table_size

# Best when table_size is prime
# Example: key=42, size=13 → 42 % 13 = 3

2. Multiplication method:

def hash_multiplication(key, table_size):
    A = 0.6180339887  # (√5 - 1) / 2, golden ratio
    return int(table_size * ((key * A) % 1))

# Works well with any table size

3. Universal hashing:

import random

class UniversalHash:
    def __init__(self, table_size):
        self.table_size = table_size
        self.p = 2**31 - 1  # Large prime
        self.a = random.randint(1, self.p - 1)
        self.b = random.randint(0, self.p - 1)
    
    def hash(self, key):
        return ((self.a * key + self.b) % self.p) % self.table_size

4. String hashing (polynomial rolling hash):

def hash_string(s, table_size):
    hash_val = 0
    p = 31  # Prime base
    p_pow = 1
    
    for char in s:
        hash_val = (hash_val + ord(char) * p_pow) % table_size
        p_pow = (p_pow * p) % table_size
    
    return hash_val

# Example: "hello" → hash value

5. Python’s built-in hash:

# For integers
hash(42) % table_size

# For strings
hash("hello") % table_size

# For tuples (immutable)
hash((1, 2, 3)) % table_size

# Note: Python's hash is randomized per session for security

Handling non-integer keys:

def hash_any_key(key, table_size):
    # Convert to integer first
    if isinstance(key, str):
        return hash_string(key, table_size)
    elif isinstance(key, int):
        return key % table_size
    elif isinstance(key, tuple):
        # Combine hashes of elements
        h = 0
        for item in key:
            h = (h * 31 + hash(item)) % table_size
        return h
    else:
        return hash(key) % table_size

Bad hash function example:

# BAD: Only uses first character
def bad_hash(s, table_size):
    return ord(s[0]) % table_size

# Problem: "apple", "ant", "arrow" all hash to same value
# Causes many collisions!

2. Load factor and resizing

Managing table capacity and performance:

Load factor (α):

Definition: α = n / m where n = elements, m = table size
Meaning: Average number of elements per slot
Impact: Higher α → more collisions → slower operations

Typical thresholds:

Chaining: Resize when α > 0.75 (Python dict uses 0.67)
Open addressing: Resize when α > 0.5 (to avoid clustering)

Resizing strategy:

class HashTable:
    def __init__(self, initial_size=16):
        self.size = initial_size
        self.count = 0
        self.table = [[] for _ in range(self.size)]  # Chaining
    
    def load_factor(self):
        return self.count / self.size
    
    def resize(self):
        # Double the size (use next prime for better distribution)
        old_table = self.table
        self.size *= 2
        self.table = [[] for _ in range(self.size)]
        self.count = 0
        
        # Rehash all elements
        for bucket in old_table:
            for key, value in bucket:
                self.insert(key, value)
    
    def insert(self, key, value):
        if self.load_factor() > 0.75:
            self.resize()
        
        index = hash(key) % self.size
        # Insert logic here
        self.count += 1

Amortized analysis of resizing:

Operations: n insertions
Resizes at: 1, 2, 4, 8, 16, ..., n
Total rehashing cost: 1 + 2 + 4 + ... + n = 2n - 1 = O(n)
Amortized cost per insertion: O(n) / n = O(1)

Shrinking strategy:

def delete(self, key):
    # Delete logic here
    self.count -= 1
    
    # Shrink if load factor too low (avoid wasting space)
    if self.count > 0 and self.load_factor() < 0.25:
        self.size //= 2
        self.rehash()

Choosing table size:

Prime numbers: Better distribution, fewer collisions
Powers of 2: Faster modulo (use bitwise AND)
Trade-off: Primes better for quality, powers of 2 faster

# Prime table sizes
PRIMES = [53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593]

# Power of 2 with bitmask
def hash_power_of_2(key, size):
    return hash(key) & (size - 1)  # Equivalent to % size when size is power of 2

3. Collision strategies: chaining vs open addressing

Two main approaches when multiple keys hash to same index:

Separate Chaining:

Idea: Each table slot stores a linked list (or dynamic array) of colliding elements
Collision: Add to list at that index

Implementation:

class HashTableChaining:
    def __init__(self, size=10):
        self.size = size
        self.table = [[] for _ in range(size)]
    
    def insert(self, key, value):
        index = hash(key) % self.size
        bucket = self.table[index]
        
        # Update if key exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        
        # Add new key-value pair
        bucket.append((key, value))
    
    def search(self, key):
        index = hash(key) % self.size
        bucket = self.table[index]
        
        for k, v in bucket:
            if k == key:
                return v
        
        return None
    
    def delete(self, key):
        index = hash(key) % self.size
        bucket = self.table[index]
        
        for i, (k, v) in enumerate(bucket):
            if k == key:
                del bucket[i]
                return True
        
        return False

Advantages:

✓ Simple to implement
✓ Never fills up (can exceed load factor 1.0)
✓ Deletion is straightforward
✓ Less sensitive to hash function quality

Disadvantages:

✗ Extra memory for pointers/lists
✗ Poor cache locality (pointer chasing)
✗ Performance degrades gradually with high load

Open Addressing:

Idea: All elements stored in table itself, no external structures
Collision: Probe for next available slot using a sequence

General structure:

class HashTableOpenAddressing:
    def __init__(self, size=10):
        self.size = size
        self.table = [None] * size
        self.count = 0
    
    def probe_sequence(self, key, i):
        # Override in subclasses
        raise NotImplementedError
    
    def insert(self, key, value):
        for i in range(self.size):
            index = self.probe_sequence(key, i)
            
            if self.table[index] is None or self.table[index] == "DELETED":
                self.table[index] = (key, value)
                self.count += 1
                return True
        
        return False  # Table full
    
    def search(self, key):
        for i in range(self.size):
            index = self.probe_sequence(key, i)
            
            if self.table[index] is None:
                return None  # Not found
            
            if self.table[index] != "DELETED" and self.table[index][0] == key:
                return self.table[index][1]
        
        return None

Advantages:

✓ Better cache locality (array access)
✓ No extra memory for pointers
✓ Faster for small tables (fits in cache)

Disadvantages:

✗ Must maintain load factor < 0.5-0.7
✗ Deletion is complex (need tombstones)
✗ Performance degrades sharply near capacity
✗ Clustering issues

Comparison:

Feature	Chaining	Open Addressing
Memory	More (pointers)	Less (no pointers)
Cache locality	Poor	Good
Load factor	Can exceed 1.0	Must stay < 0.7
Deletion	Simple	Complex (tombstones)
Clustering	No	Yes
Implementation	Easier	Harder
Best for	General use	Cache-friendly, known size

4. Linear probing, quadratic probing, double hashing

Different probe sequences for open addressing:

Linear Probing:

Formula: h(k, i) = (h(k) + i) % m
Sequence: Try h(k), h(k)+1, h(k)+2, …

Implementation:

class LinearProbing(HashTableOpenAddressing):
    def probe_sequence(self, key, i):
        return (hash(key) + i) % self.size

Example:

Table size: 10
Insert key with hash(key) = 3

Probe sequence: 3, 4, 5, 6, 7, 8, 9, 0, 1, 2

Advantages:

✓ Simple to implement
✓ Good cache locality (sequential access)

Disadvantages:

✗ Primary clustering: Long runs of occupied slots
✗ Performance degrades with high load factor

Primary clustering example:

Table: [_, _, X, X, X, X, _, _, _, _]
       Cluster forms here ↑
New insertion at 2 → probes 2,3,4,5,6 → inserts at 6
Cluster grows longer!

Quadratic Probing:

Formula: h(k, i) = (h(k) + c₁*i + c₂*i²) % m
Common: h(k, i) = (h(k) + i²) % m

Implementation:

class QuadraticProbing(HashTableOpenAddressing):
    def probe_sequence(self, key, i):
        return (hash(key) + i * i) % self.size

Example:

Table size: 10
Insert key with hash(key) = 3

Probe sequence: 3, 4, 7, 2, 9, 8, 9, 2, ... (wraps around)
Offsets:        0, 1, 4, 9, 16, 25, ...

Advantages:

✓ Reduces primary clustering
✓ Better distribution than linear

Disadvantages:

✗ Secondary clustering: Keys with same hash follow same probe sequence
✗ May not probe all slots (need table size to be prime or power of 2)

Secondary clustering example:

Keys k1, k2 with hash(k1) = hash(k2) = 5
Both follow same probe sequence: 5, 6, 9, 4, 1, ...

Double Hashing:

Formula: h(k, i) = (h₁(k) + i * h₂(k)) % m
Two hash functions: h₁ for initial position, h₂ for step size

Implementation:

class DoubleHashing(HashTableOpenAddressing):
    def __init__(self, size=10):
        super().__init__(size)
        self.prime = self._prev_prime(size)
    
    def _prev_prime(self, n):
        # Find largest prime < n
        for p in range(n-1, 1, -1):
            if self._is_prime(p):
                return p
        return 2
    
    def _is_prime(self, n):
        if n < 2:
            return False
        for i in range(2, int(n**0.5) + 1):
            if n % i == 0:
                return False
        return True
    
    def hash1(self, key):
        return hash(key) % self.size
    
    def hash2(self, key):
        # h2(k) must be non-zero and coprime with table size
        return self.prime - (hash(key) % self.prime)
    
    def probe_sequence(self, key, i):
        return (self.hash1(key) + i * self.hash2(key)) % self.size

Example:

Table size: 10
h1(k) = 3, h2(k) = 7

Probe sequence: 3, 0, 7, 4, 1, 8, 5, 2, 9, 6
Offsets:        0, 7, 14, 21, 28, ...

Advantages:

✓ Eliminates both primary and secondary clustering
✓ Best distribution among open addressing methods
✓ Probes entire table (if h2 and m are coprime)

Disadvantages:

✗ More complex to implement
✗ Two hash function computations
✗ Slightly slower per probe

Comparison:

Method	Clustering	Complexity	Distribution
Linear	Primary	Simple	Poor
Quadratic	Secondary	Moderate	Good
Double Hashing	None	Complex	Excellent

5. Deletion nuances in open addressing

Handling deletions without breaking probe sequences:

The problem:

# Initial state
table = [None, (k1, v1), (k2, v2), None, ...]
# k1 and k2 hash to index 1, k2 probed to index 2

# Delete k1
table[1] = None  # WRONG!

# Now search for k2
# Probe: index 1 → None → stop searching
# k2 is lost even though it's at index 2!

Solution: Tombstones (lazy deletion):

class HashTableWithDeletion:
    DELETED = object()  # Sentinel value
    
    def __init__(self, size=10):
        self.size = size
        self.table = [None] * size
    
    def insert(self, key, value):
        for i in range(self.size):
            index = (hash(key) + i) % self.size
            
            # Can insert at None or DELETED slot
            if self.table[index] is None or self.table[index] is self.DELETED:
                self.table[index] = (key, value)
                return True
            
            # Update existing key
            if self.table[index][0] == key:
                self.table[index] = (key, value)
                return True
        
        return False
    
    def search(self, key):
        for i in range(self.size):
            index = (hash(key) + i) % self.size
            
            if self.table[index] is None:
                return None  # Not found
            
            # Skip DELETED, continue probing
            if self.table[index] is not self.DELETED:
                if self.table[index][0] == key:
                    return self.table[index][1]
        
        return None
    
    def delete(self, key):
        for i in range(self.size):
            index = (hash(key) + i) % self.size
            
            if self.table[index] is None:
                return False  # Not found
            
            if self.table[index] is not self.DELETED:
                if self.table[index][0] == key:
                    self.table[index] = self.DELETED
                    return True
        
        return False

Tombstone issues:

Space accumulation: DELETED markers take up space
Performance degradation: Must probe through tombstones
Load factor: Should count tombstones in load calculation?

Solutions:

def needs_rehash(self):
    # Rehash if too many tombstones
    active = sum(1 for slot in self.table 
                 if slot is not None and slot is not self.DELETED)
    deleted = sum(1 for slot in self.table if slot is self.DELETED)
    
    return deleted > self.size * 0.25 or active / self.size > 0.7

def rehash(self):
    old_table = self.table
    self.table = [None] * self.size
    
    for slot in old_table:
        if slot is not None and slot is not self.DELETED:
            key, value = slot
            self.insert(key, value)

Alternative: Robin Hood hashing:

Minimize variance in probe lengths
When inserting, “steal” from rich (short probe) to give to poor (long probe)
Improves worst-case performance

6. Applications: caches, dictionaries

Real-world uses of hash tables:

Dictionaries / Maps:

# Python dict (hash table implementation)
phonebook = {
    "Alice": "555-1234",
    "Bob": "555-5678",
    "Charlie": "555-9012"
}

# O(1) average case operations
phonebook["Alice"]  # Lookup
phonebook["David"] = "555-3456"  # Insert
del phonebook["Bob"]  # Delete
"Charlie" in phonebook  # Membership test

Caches (LRU Cache):

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key):
        if key not in self.cache:
            return -1
        # Move to end (most recently used)
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            # Remove least recently used (first item)
            self.cache.popitem(last=False)

Sets (unique elements):

# Python set (hash table for keys only)
seen = set()
for item in data:
    if item not in seen:
        seen.add(item)
        process(item)

Frequency counting:

from collections import Counter

# Count word frequencies
text = "the quick brown fox jumps over the lazy dog"
word_count = Counter(text.split())
# Counter({'the': 2, 'quick': 1, 'brown': 1, ...})

Database indexing:

# Hash index for fast lookups
class HashIndex:
    def __init__(self):
        self.index = {}
    
    def insert(self, key, row_id):
        if key not in self.index:
            self.index[key] = []
        self.index[key].append(row_id)
    
    def lookup(self, key):
        return self.index.get(key, [])

Deduplication:

def remove_duplicates(arr):
    seen = set()
    result = []
    for item in arr:
        if item not in seen:
            seen.add(item)
            result.append(item)
    return result

Two Sum problem:

def two_sum(nums, target):
    seen = {}
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
    return None

7. Complexity and pitfalls

Understanding performance and common mistakes:

Time complexity:

Operation	Average	Worst Case	Notes
Insert	O(1)	O(n)	Worst: all keys collide
Search	O(1)	O(n)	Worst: all keys collide
Delete	O(1)	O(n)	Worst: all keys collide
Resize	O(n)	O(n)	Amortized O(1) per insert

Space complexity:

Chaining: O(n + m) where n = elements, m = table size
Open addressing: O(m), must keep load factor low

Common pitfalls:

1. Using mutable keys:

# BAD: Lists are mutable, can't be hashed
d = {[1, 2]: "value"}  # TypeError!

# GOOD: Use immutable types
d = {(1, 2): "value"}  # Tuples work
d = {frozenset([1, 2]): "value"}  # Frozen sets work

2. Poor hash function:

# BAD: All strings hash to same value
def bad_hash(s):
    return len(s) % 10

# All 5-letter words collide!

3. Ignoring load factor:

# BAD: Never resize
class BadHashTable:
    def insert(self, key, value):
        # Just keep inserting...
        # Performance degrades to O(n)!
        pass

4. Forgetting tombstones in open addressing:

# BAD: Delete by setting to None
def delete(self, key):
    index = self.find(key)
    self.table[index] = None  # Breaks probe sequences!

5. Hash flooding attack:

# Attacker sends keys that all hash to same value
# Causes O(n) operations, DoS attack
# Solution: Use randomized hash functions (Python 3.3+)

Performance tips:

Choose good initial size: Avoid early resizes
Use prime table sizes: Better distribution
Monitor load factor: Resize before performance degrades
Profile hash function: Ensure uniform distribution
Consider alternatives: For ordered data, use trees

When NOT to use hash tables:

✗ Need ordered iteration (use BST or sorted array)
✗ Need range queries (use BST or B-tree)
✗ Need minimum/maximum (use heap)
✗ Memory constrained (hash tables waste space)
✗ Worst-case guarantees needed (use balanced trees)