Introduction to Data Structures
Understand what data structures are and why they matter.
Definition and role of data structures
A data structure is a specialized format for organizing, storing, and managing data in a computer so that it can be accessed and modified efficiently. Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and internet indexing services.
Why do we need data structures?
-
Efficiency: Different data structures excel at different operations. Choosing the right one can make the difference between an algorithm that runs in milliseconds versus hours.
-
Organization: Data structures provide a systematic way to organize data, making code more readable and maintainable.
-
Reusability: Well-designed data structures can be reused across different problems and applications.
-
Abstraction: They allow us to think about data at a higher level, hiding implementation details.
Real-world analogy: Think of data structures like furniture in a house:
- A bookshelf organizes books for easy retrieval (like an array)
- A filing cabinet with labeled drawers (like a hash table)
- A stack of plates where you take from the top (like a stack)
- A line at a store where first person in line is served first (like a queue)
Example demonstrating the importance:
# Checking if an element exists
# Approach 1: Using a list (array)
my_list = [1, 2, 3, 4, 5, ..., 1000000]
if 999999 in my_list: # O(n) - must scan through all elements
print("Found!")
# Approach 2: Using a set (hash table)
my_set = {1, 2, 3, 4, 5, ..., 1000000}
if 999999 in my_set: # O(1) - direct lookup
print("Found!")
For a million elements, the set lookup is nearly instantaneous while the list scan could take significant time.
Time vs space trade-offs
Time complexity measures how the runtime of an algorithm grows with input size. Space complexity measures how much memory is required.
The Trade-off: Often, we can make an algorithm faster by using more memory, or save memory by accepting slower execution. This is called the time-space trade-off.
Key Principle: There’s rarely a solution that is both the fastest AND uses the least memory. Engineers must choose based on the constraints of their problem.
Common Trade-off Examples:
- Caching/Memoization (Trading space for time):
# Without memoization - slower, less memory
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2) # O(2^n) time, O(n) space
# With memoization - faster, more memory
cache = {}
def fibonacci_memo(n):
if n in cache:
return cache[n]
if n <= 1:
return n
cache[n] = fibonacci_memo(n-1) + fibonacci_memo(n-2) # O(n) time, O(n) space
return cache[n]
- Lookup Tables (Trading space for time):
# Computing square roots each time - slow, minimal space
def process_numbers(numbers):
results = []
for n in numbers:
results.append(n ** 0.5) # Compute each time
return results
# Pre-computed lookup table - fast, more space
import math
sqrt_table = {i: math.sqrt(i) for i in range(10000)} # Pre-compute
def process_numbers_fast(numbers):
results = []
for n in numbers:
results.append(sqrt_table[n]) # O(1) lookup
return results
- In-place vs Out-of-place Algorithms (Trading time for space):
# Out-of-place: Creates new array - O(n) space
def reverse_list(arr):
return arr[::-1] # Creates a copy
# In-place: Modifies original - O(1) space
def reverse_list_inplace(arr):
left, right = 0, len(arr) - 1
while left < right:
arr[left], arr[right] = arr[right], arr[left]
left += 1
right -= 1
return arr
When to choose what?
- Choose time optimization when: dealing with large datasets, user experience matters, memory is abundant
- Choose space optimization when: memory is limited (embedded systems), data is too large to fit in RAM, running on resource-constrained devices
Asymptotic analysis and Big-O
Asymptotic analysis is a method of describing the performance of an algorithm as the input size approaches infinity. It focuses on growth rates rather than exact counts.
Big-O Notation describes the upper bound of an algorithm’s growth rate - the worst-case scenario.
Why asymptotic?
- Actual runtime depends on hardware, language, compiler optimizations
- We care about scalability: how does it perform with huge inputs?
- Constants and lower-order terms become insignificant with large n
Common Big-O Classes (from fastest to slowest):
| Notation | Name | Example | Growth |
|---|---|---|---|
| O(1) | Constant | Array access, hash lookup | Doesn’t grow |
| O(log n) | Logarithmic | Binary search | Very slow growth |
| O(n) | Linear | Linear search, array scan | Proportional |
| O(n log n) | Linearithmic | Merge sort, quicksort | Moderate |
| O(n²) | Quadratic | Nested loops, bubble sort | Fast growth |
| O(n³) | Cubic | Triple nested loops | Very fast growth |
| O(2ⁿ) | Exponential | Recursive fibonacci | Explosive growth |
| O(n!) | Factorial | Generate all permutations | Extremely explosive |
Visual comparison for n = 100:
- O(1) = 1 operation
- O(log n) ≈ 7 operations
- O(n) = 100 operations
- O(n log n) ≈ 664 operations
- O(n²) = 10,000 operations
- O(2ⁿ) ≈ 1.27 × 10³⁰ operations (universe-ending!)
Big-O Rules:
- Drop constants: O(2n) → O(n), O(100) → O(1)
- Drop lower-order terms: O(n² + n) → O(n²)
- Different variables for different inputs: O(a + b), O(a × b)
- Worst case: Big-O describes the worst-case scenario
Examples:
# O(1) - Constant time
def get_first_element(arr):
return arr[0] # Always 1 operation
# O(n) - Linear time
def find_max(arr):
max_val = arr[0]
for num in arr: # n iterations
if num > max_val:
max_val = num
return max_val
# O(n²) - Quadratic time
def bubble_sort(arr):
n = len(arr)
for i in range(n): # n iterations
for j in range(n-1): # n iterations
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr
# O(log n) - Logarithmic time
def binary_search(sorted_arr, target):
left, right = 0, len(sorted_arr) - 1
while left <= right:
mid = (left + right) // 2
if sorted_arr[mid] == target:
return mid
elif sorted_arr[mid] < target:
left = mid + 1
else:
right = mid - 1 # Halve the search space each iteration
return -1
# O(n log n) - Linearithmic time
def merge_sort(arr):
if len(arr) <= 1:
return arr
mid = len(arr) // 2
left = merge_sort(arr[:mid]) # log n divisions
right = merge_sort(arr[mid:])
return merge(left, right) # n work per level
Other notations (less commonly used):
- Big-Omega (Ω): Lower bound (best case)
- Big-Theta (Θ): Tight bound (average case)
- Little-o (o): Strict upper bound
Big-O of common operations
Understanding the complexity of common operations helps you choose the right data structure for your problem.
Arrays (Lists in Python):
| Operation | Time Complexity | Notes |
|---|---|---|
| Access by index | O(1) | Direct memory access |
| Search (unsorted) | O(n) | Must check each element |
| Search (sorted) | O(log n) | Can use binary search |
| Insert at end | O(1) amortized | May need to resize |
| Insert at beginning | O(n) | Must shift all elements |
| Delete at end | O(1) | Simple removal |
| Delete at beginning | O(n) | Must shift all elements |
arr = [1, 2, 3, 4, 5]
x = arr[2] # O(1) - direct access
arr.append(6) # O(1) amortized
arr.insert(0, 0) # O(n) - shifts all elements
Linked Lists:
| Operation | Time Complexity | Notes |
|---|---|---|
| Access by index | O(n) | Must traverse from head |
| Search | O(n) | Must traverse sequentially |
| Insert at beginning | O(1) | Just update head pointer |
| Insert at end | O(n) or O(1) | O(1) if tail pointer exists |
| Delete at beginning | O(1) | Update head pointer |
| Delete at end | O(n) | Need to find second-to-last |
Hash Tables (Dictionaries in Python):
| Operation | Average | Worst Case | Notes |
|---|---|---|---|
| Search | O(1) | O(n) | Worst case: all collisions |
| Insert | O(1) | O(n) | Worst case: rehashing |
| Delete | O(1) | O(n) | Worst case: many collisions |
hash_map = {}
hash_map['key'] = 'value' # O(1) insert
value = hash_map['key'] # O(1) lookup
del hash_map['key'] # O(1) delete
Binary Search Trees (Balanced):
| Operation | Average | Worst Case (Unbalanced) |
|---|---|---|
| Search | O(log n) | O(n) |
| Insert | O(log n) | O(n) |
| Delete | O(log n) | O(n) |
Stacks and Queues:
| Operation | Time Complexity |
|---|---|
| Push/Enqueue | O(1) |
| Pop/Dequeue | O(1) |
| Peek/Front | O(1) |
| Search | O(n) |
Heaps (Priority Queues):
| Operation | Time Complexity |
|---|---|
| Find min/max | O(1) |
| Insert | O(log n) |
| Delete min/max | O(log n) |
| Build heap | O(n) |
Sorting Algorithms:
| Algorithm | Average | Worst | Space | Stable? |
|---|---|---|---|---|
| Quick Sort | O(n log n) | O(n²) | O(log n) | No |
| Merge Sort | O(n log n) | O(n log n) | O(n) | Yes |
| Heap Sort | O(n log n) | O(n log n) | O(1) | No |
| Bubble Sort | O(n²) | O(n²) | O(1) | Yes |
| Insertion Sort | O(n²) | O(n²) | O(1) | Yes |
Quick Reference for choosing operations:
- Need fast random access? → Use arrays
- Need fast insertion/deletion at beginning? → Use linked lists
- Need fast key-value lookup? → Use hash tables
- Need sorted data with fast operations? → Use balanced BSTs
- Need LIFO access? → Use stacks
- Need FIFO access? → Use queues
- Need min/max quickly? → Use heaps
When to choose a structure
Choosing the right data structure is crucial for efficient algorithm design. Here’s a decision guide:
Ask these questions:
- What operations will be most frequent?
- Lots of lookups? → Hash table or BST
- Lots of insertions/deletions? → Linked list or balanced tree
- Processing in order? → Queue
- Need to undo operations? → Stack
- Do you need ordered data?
- Yes, sorted order → BST, sorted array
- Yes, insertion order → Array, linked list
- No → Hash table (fastest)
- Do you know the size in advance?
- Yes, fixed size → Array
- No, dynamic → Linked list, dynamic array, hash table
- What are your constraints?
- Limited memory → Choose space-efficient structures
- Need speed → Choose time-efficient structures
- Need both → Make trade-offs based on primary use case
Decision Tree Examples:
Scenario 1: Building a phone book
- Need: Fast lookup by name
- Need: Alphabetical order useful but not required
- Operations: Search (frequent), Insert (rare), Delete (rare)
- Choice: Hash table (O(1) lookup) or BST (O(log n) with ordering)
# Using hash table
phone_book = {
"Alice": "555-1234",
"Bob": "555-5678"
}
print(phone_book["Alice"]) # O(1) lookup
Scenario 2: Browser history
- Need: Recent pages accessed frequently
- Need: Add new page, go back/forward
- Operations: Push (frequent), Pop (frequent), Random access (rare)
- Choice: Stack for back button, Stack for forward button
back_stack = []
forward_stack = []
def visit_page(url):
back_stack.append(current_page)
forward_stack.clear() # Can't go forward after new visit
def go_back():
if back_stack:
forward_stack.append(current_page)
return back_stack.pop()
Scenario 3: Print queue
- Need: First document submitted prints first
- Operations: Add to queue (frequent), Remove from queue (frequent)
- Choice: Queue (FIFO behavior)
from collections import deque
print_queue = deque()
print_queue.append("document1.pdf") # Enqueue
print_queue.append("document2.pdf")
current_job = print_queue.popleft() # Dequeue - first in, first out
Scenario 4: Autocomplete suggestions
- Need: Find all words with common prefix
- Operations: Prefix search (frequent), Insert new words (occasional)
- Choice: Trie (prefix tree)
# Conceptual trie structure
# "cat", "car", "dog" stored as:
# root
# / \
# c d
# | |
# a o
# / \ |
# t r g
Scenario 5: Finding shortest path in maps
- Need: Graph structure with weighted edges
- Operations: Find neighbors, find shortest path
- Choice: Adjacency list + Priority queue (for Dijkstra’s)
Scenario 6: LRU Cache
- Need: Fast access, track recent usage, evict least recent
- Operations: Get (O(1)), Put (O(1)), Track order
- Choice: Hash table + Doubly linked list
# Conceptual structure
cache = {} # Hash table for O(1) access
dll = DoublyLinkedList() # Track usage order
Common Pitfalls:
- ❌ Using lists for frequent lookups (O(n) instead of O(1))
- ❌ Using arrays when size is unknown (costly resizing)
- ❌ Using BST when order doesn’t matter (hash table is faster)
- ❌ Using recursive structures without considering stack overflow
General Guidelines:
- Default to hash tables for key-value storage (most versatile)
- Use arrays when you need index-based access and size is manageable
- Use linked lists when you have many insertions/deletions
- Use trees when you need both ordering and fast operations
- Use specialized structures (stacks, queues, heaps) when they match your access pattern
Abstraction and ADTs
Abstract Data Type (ADT) is a theoretical concept that defines a data type by its behavior (operations) rather than its implementation. It specifies what operations can be performed, not how they are implemented.
Key Principle: Separate the interface (what operations are available) from the implementation (how they work internally).
Benefits of Abstraction:
- Flexibility: Can change implementation without affecting users
- Simplicity: Users don’t need to understand internal complexity
- Reusability: Same interface can work with different implementations
- Maintainability: Easier to update and fix bugs
Common ADTs:
1. Stack ADT
- Interface:
push(),pop(),peek(),is_empty() - Implementations: Array-based, Linked-list-based
- Behavior: LIFO (Last In, First Out)
# ADT Definition (Interface)
class StackADT:
def push(self, item): pass
def pop(self): pass
def peek(self): pass
def is_empty(self): pass
# Implementation 1: Array-based
class ArrayStack(StackADT):
def __init__(self):
self._data = []
def push(self, item):
self._data.append(item)
def pop(self):
return self._data.pop()
# Implementation 2: Linked-list-based
class LinkedStack(StackADT):
def __init__(self):
self._head = None
self._size = 0
def push(self, item):
new_node = Node(item, self._head)
self._head = new_node
self._size += 1
def pop(self):
if self._head is None:
raise IndexError("pop from empty stack")
value = self._head.value
self._head = self._head.next
self._size -= 1
return value
2. Queue ADT
- Interface:
enqueue(),dequeue(),front(),is_empty() - Behavior: FIFO (First In, First Out)
3. List ADT
- Interface:
get(index),set(index, value),insert(index, value),delete(index),size() - Implementations: Array-based (Python list), Linked-list-based
4. Map/Dictionary ADT
- Interface:
get(key),put(key, value),remove(key),contains(key) - Implementations: Hash table, BST, Skip list
5. Priority Queue ADT
- Interface:
insert(item),remove_min(),min(),is_empty() - Implementations: Heap, Unsorted array, Sorted array
Example: Why Abstraction Matters
Suppose you’re building a task scheduler. You start with a simple list:
# Version 1: Direct implementation (no abstraction)
tasks = []
def add_task(task):
tasks.append(task)
def get_next_task():
return tasks.pop(0) # O(n) - inefficient!
Later, you realize this is slow. With abstraction, you can swap implementations:
# Version 2: Using ADT abstraction
from collections import deque
class TaskQueue:
def __init__(self):
self._tasks = deque() # Changed implementation
def add_task(self, task):
self._tasks.append(task)
def get_next_task(self):
return self._tasks.popleft() # O(1) - much better!
# User code stays the same!
queue = TaskQueue()
queue.add_task("Task 1")
next_task = queue.get_next_task()
Encapsulation in ADTs:
Good ADTs hide implementation details:
# Good: Implementation hidden
class Stack:
def __init__(self):
self._data = [] # Private variable (by convention)
def push(self, item):
self._data.append(item)
# Users can't directly access self._data
# Bad: Implementation exposed
class BadStack:
def __init__(self):
self.data = [] # Public - users can mess with it!
def push(self, item):
self.data.append(item)
# Problem with bad design:
bad_stack = BadStack()
bad_stack.data = "oops!" # Breaks the stack!
ADT vs Data Structure vs Implementation:
- ADT (Abstract): The concept/interface (e.g., “Stack”)
- Data Structure (Logical): How data is organized (e.g., “Array” or “Linked List”)
- Implementation (Physical): Actual code in a programming language
Example:
ADT: List
├─ Data Structure: Array
│ └─ Implementation: Python's list class
└─ Data Structure: Linked List
└─ Implementation: Custom LinkedList class
Real-world analogy:
- ADT = “Car” (steering wheel, gas pedal, brake)
- Implementation = Electric car vs Gas car (different engines, same interface)
- You can drive both because the interface is the same, even though the internal mechanisms differ
Key Takeaway: When designing systems, think in terms of ADTs first. Define the operations you need, then choose or create an implementation. This makes your code more flexible, maintainable, and easier to optimize later.