Grokking Data Structures - Book Review
Contents
This blog post is about a book titled “Grokking Data Structures” written by Marcello La Rocca
Introducing data structures
- Python is an multi-paradigm language. Although Python is fundamentally imperative, it supports several programming styles:
- Procedural / Imperative: via statements, loops, and assignments.
- Object-Oriented: via classes and objects.
- Functional: via map, filter, lambda, and higher-order functions.
- Declarative-like: through frameworks and libraries such as SQLAlchemy or Pandas expressions that emphasize what over how.
- Python is a multi-paradigm language, primarily imperative, but supporting functional and object-oriented styles.
Paradigm Example Languages Core Idea Python Fit Imperative C, Python, Java Specify how to perform tasks step by step Strong Declarative SQL, HTML, Prolog Specify what result you want Partial Functional Haskell, Lisp Avoid state, use pure functions Partial Object-Oriented Java, C++, Python Model data as interacting objects Strong - Data structures are a way of organizing and storing information in a computer or a program. They help to efficiently manage and manipulate data
- Why are data structures and algorithms mentioned or studied together ? Think of data structures as nouns and algorithms as verbs. The former implicitly defines algorithms for operations, such as adding, retrieving, and removing its elements
- Some data structures are specifically designed to allow the efficient execution of certain algorithms, such as hash tables for key-based search
- Why care about data structures ?
- New trends take advantage of data structures that have emerged
- Makes you a better developer
- You will have more tools to play with and attack a problem
- Mental model of data structures
- Understand the problem you are solving
- Sketch out a possible solution
- Identify the data structures you need
- Implement a solution
- Check whether the solution works; if not, iterate
- Checks whether the solution is good enough; if not, iterate
- Nice simple example of trying to register pets at a doctor’s clinic to illustrate stacks, queues and priority queues
- Algorithms and data structures complement each other the way nouns and verbs complement each other in a sentence
- Data structures are about how you store data. Algorithms are about how you use that data. Together, they determine how fast and how efficiently your program runs.
Static Arrays
- Think of memory as a modular shelf holding removable drawers
- Statically vs. dynamically sized arrays
- Array data structure has the following properties
- It stores a collection of data
- Its elements can be accessed by index
- Elements don’t have to be accessed sequentially
- Python offers lists that are closer to dynamic arrays
- The term array usually serves as a synonym for a statically sized array, a collection of elements accessed by an index, where the number of elements is fixed for the entire lifetime of the collection
- Many programming languages such as C or Java, offer static arrays as a built-in feature
- Arrays can be initialized at compile time. If a language allows you to skip initialization, then the initial value of the array’s elements depends on the language
Sorted Arrays
- A sorted array is an array whose elements are kept in order as they change
- To maintain the elements of an array in order, we need a different approach when inserting and deleting elements. These methods must preserve the order and therefore require more effort than their counterparts for unsorted arrays
- On an already sorted array, we can run binary search, a search algorithm that can find a match by looking at fewer elements than linear search
- With sorted arrays, you have faster search, but you also have an extra cost to keep them sorted. Therefore, they should be preferred when there is a high read-to-write ratio
Big-O notation
- Understanding this helps in choosing the right data structure for the problem
- One can use profiling to check the performance of an implementation but there
are issues with using profiling
- Since we profile using a specific implementation of an algo, the results are heavily influenced by the programming language we choose
- We can’t assume that the profiling tests will hold good for all implementations
- All profiling results are based on finite input
- Advantage of Asymptotic analysis is that it is independent of any implementation
- Profiling is useful in the implementation phase whereas Asymptotic analysis is useful in the design phase
- RAM random access machine is a good simplified model to understand Asymptotic analysis
- There are three types of analysis that can be done on the performance of an
algo
- worst case analysis
- average analysis
- amortized analysis
- The RAM model is a simplified computational model of a generic computer that provides only a limited set of basic instructions
- Big-O notation is used to classify functions based on their asymptotic growth. We use these classes of functions to express how fast the running time or memory used by an algorithm grows as the input becomes larger
Dynamic arrays
- The main issue with going for dynamic arrays is that one has to resize the array and move the data. Both are expensive operations and there is no option but to do it
- The key to dynamic arrays is the strategy one adopts when shrinking or growing an array
- The ideal case for growing an array is when it hits the capacity and double it
- The ideal strategy for shrinking an array is when the elements falls below a certain threshold
- Dynamic arrays are a way to get the best of arrays and add some flexibility. They are not a different type of data structure. They use fixed-size arrays but add a strategy to grow and shrink them as needed, reallocating the underlying static array each time it needs to resized
- The best strategy for dynamic arrays is to double the size of underlying static array when we try to insert a new element into a full array and halve the size of the array when, after removing an elements, three-quarters of the elements are empty
Shrink Operation:
Worst-case performance:
- The worst case for the shrink operation occurs when the number of elements in the array reaches the threshold for shrinking (e.g., 1/4 of the capacity) after a series of deletions.
- In this case, the shrink operation needs to create a new array with half the size and copy all the remaining elements to the new array.
- The time complexity of the shrink operation in the worst case is O(n), where n is the number of elements in the array before shrinking.
Average-case performance:
- The average case for the shrink operation depends on the distribution of the number of elements in the array and the frequency of deletions.
- On average, the shrink operation will be triggered less frequently than the worst case since deletions may not always reach the threshold for shrinking.
- The average-case time complexity of the shrink operation can be considered as O(n), but with a lower constant factor compared to the worst case.
Amortized-case performance:
- The amortized-case performance of the shrink operation takes into account the overall cost of multiple shrink operations over a sequence of deletions.
- Although each individual shrink operation has a worst-case time complexity of O(n), the amortized cost of shrinking is O(1).
- This is because the cost of copying elements during shrinking is offset by the larger number of deletions that can be performed before the next shrinking operation is triggered.
- The amortized analysis guarantees that the average cost per deletion, including the cost of occasional shrinking, is constant.
Delete Operation:
Worst-case performance:
- The worst case for the delete operation occurs when the element to be deleted is at the beginning of the array.
- In this case, all the elements after the deleted element need to be shifted one position to the left to fill the gap.
- The time complexity of the delete operation in the worst case is O(n), where n is the number of elements in the array.
Average-case performance:
- The average case for the delete operation depends on the position of the element to be deleted.
- On average, the delete operation requires shifting half of the elements in the array.
- The average-case time complexity of the delete operation is O(n/2), which simplifies to O(n).
Amortized-case performance:
- The amortized-case performance of the delete operation is O(1).
- Although each individual delete operation may require shifting elements, the amortized cost per deletion remains constant.
- This is because the cost of shifting elements is amortized over multiple deletions, and the dynamic array’s resizing strategy (e.g., shrinking when the number of elements falls below a threshold) ensures that the overall cost is amortized.
It’s important to note that the amortized-case performance provides a more comprehensive analysis of the overall efficiency of the dynamic array operations, considering the long-term behavior and the cost of resizing operations.
In practice, the amortized-case performance is often more relevant than the worst-case performance because it gives a better indication of the average-case behavior and the efficiency of the data structure over a sequence of operations.
Linked Lists
- Dynamic arrays give us an illusion of flexibility, but they are not a different data structure. They are just a strategy to resize static arrays as efficiently as possible
- Main difference between array and linked list is that in the latter the data is not stored in contiguous blocks of memory. This means knowing the position of the first element of the linked list does not automatically determine the position of the rest of the elements of the array
- Since the location of each data element is not automatically known, there is a need to store the memory location of each of the element
- The first element of a linked list is called its head, and the last element of the list is called its tail. The characteristic of a tail node is that its next pointer doesn’t point to another node
- Linked lists as two tiered data structures. The external data structure that
implements the linked list itself and provides an API for the clients to
interact with the list and perform our usual operations
- Shell or wrapper around the linked list that provides the API
- Linear time insertion in singly linked lists
- Inserting elements at the front of the singly linked list is faster than insertion at the end because of the asymmetry of data storage. In singly linked list, we store the pointer to the head but not the pointer to the tail
- We cannot use
_searchmethod in a linked list to delete a node - In a sorted linked list, it is difficult to improve the performance of searching from linear time.
- Computer game example to highlight the importance of doubly linked list
- Circular linked lists can contain singly linked list or doubly linked list
- Circular linked lists are lists where the successor of the tail of the list is the head of the list. All nodes have a successor and all nodes have a predecessor
- Circular linked lists are used whenever we need to traverse a list repeatedly
- Operations such as insert, delete and search are, for most part, as fast in SLLs as DLLs. Doubly linked lists are faster when we need to insert or delete an element from the end of the list
- DLLs require more memory than SLLs
- In a doubly linked list, each node also stores a pointer to its predecessor. Therefor, doubly linked lists can also be traversed from their tail to their head making it easier to read the list from both directions.
Bags
Hierarchy
- An abstract data structure focuses on what operations you can perform on the data, without specifying how those operations are implemented
- A data structure specifies more concretely how data is represented, as well as algorithms for performing operations on the data
- There are three levels of hierarchy in data structures
- ADT is a theoretical concept that describes at a high level how data can be organized and the operations that can be performed on the data. It provides little or no detail about the internal representation of the data, how the data is stored, or how physical memory is used
- DS is the refinement of the spec provided by ADT where the computational complexity of its operations are normally discussed
- Implementation: we have to write code for the data structure, so we choose a programming language and translate the general instructions given by the data structure into code
- An ADT can be specified by many DSes. A DS can be implemented in many ways
- Array can fit in to many levels of hierarchy
- Linked lists and arrays are two refinements, two data structures, stemming from the same abstract data type
- Another example that can be used to distinguish between the three levels of hierarchy is light switch. It can be thoughts of as an ADT, data structure and also at implementation
Real-world analogies illustrating ADT,DS
-
- Book Library System
- ADT: Library — defines operations such as add, remove, search, and borrow books. Abstract idea of managing a collection of books.
- DS: Could be implemented using a list, tree, or hash map to organize and retrieve books.
- Implementation: Actual database code or system (e.g., MySQL tables, JSON files, or Python dictionaries) that carries out these operations.
-
- Music Playlist
- ADT: Playlist — defines operations such as add song, remove song, play next, and shuffle.
- DS: Could use a linked list (for sequential play), array (for random access), or queue.
- Implementation: Spotify or Apple Music feature coded in Java/Kotlin/Python managing songs in memory or through an API.
-
- Restaurant Queue
- ADT: Queue — defines enqueue (customer joins line) and dequeue (customer is served).
- DS: Could be a linked list, array, or circular buffer.
- Implementation: POS system code that manages waiting customers written in Python, C#, or JavaScript.
-
- Calendar or To-Do List
- ADT: Scheduler — defines add event, delete event, and find next event operations.
- DS: Could use a priority queue or balanced binary search tree to manage events by time.
- Implementation: Calendar or task management app written in Swift or JavaScript using SQLite or heap structures.
-
- Light Switch
- ADT: Switch — has two states (on, off) and operations (turn on, turn off).
- DS: Could be modeled as a boolean variable or simple state machine.
- Implementation: Physical circuit or GUI button controlling an electrical or software process.
Containers
A container is an abstract concept, a definition of a large group of data structures with common characteristics.
- Containers are collection of elements. They hold multiple elements, which can be of the same or different type and can be stored in a particular order without any order
- Containers typically provide the same set of basic operations to insert, delete, access, modify and search elements in the collection
- Containers can be traversed
- Containers can maintain the elements they store in a certain order
- Containers are designed to provide efficient access to their elements.
A bag can be considered as an ADT and we can use the following DS
- Static arrays
- Dynamic arrays
- Linked lists
A bag can be implemented on top of basic data structure such as arrays and linked lists. The singly linked list implementation guarantees the best running time for both operations defined on a bag.
Stacks
- Stack works on LIFO principle
- If one uses dynamic array as the data structure for implementing stack, here
are the following disadvantages
- Occasional Costly resizing: While the amortized cost is O(1), for occasional resizes can cause latency spikes
- If the stack fluctuates a lot in size, you will keep holding onto that large underlying array until explicitly trimmed
- If the stack grows large, finding a contiguous block big enough might be an issue in low-memory environments
- Most implementations don’t automatically reduce capacity when you pop elements
- When resizing occurs, all existing pointers/references/iterators to stack elements become invalid because the memory location changes
| Aspect | Dynamic Array Stack | Comment |
|---|---|---|
| Push (avg) | O(1) | Amortized constant time |
| Push (worst) | O(n) | When resizing occurs |
| Pop | O(1) | Always |
| Memory efficiency | Medium | May over-allocate |
| Reference stability | Poor | Resizing invalidates references |
| Memory layout | Contiguous | May cause fragmentation issues |
| Ease of implementation | Easy | Simple and cache-friendly |
- Linked list is one of the best ways to implement stack
- However the author shows that if one compares a dynamic array implementation
of stack and compare it with the linked list implementation, it turns out that
the former is much faster
- While the worst-case running time for push and pop is linear, their amortized running time is as good as linked lists.
- Python provides an optimized, extremely efficient implementation for
list. This code is usually written in C and compiled for use in Python to make sure it’s as efficient as possible
- Applications of Stack
- Call stack
- Evaluating expressions in postfix notation
- Undo/redo functionality in an IDE
Quick Primer on cProfile and pstats in Python
What is cProfile
- A deterministic profiler built into Python’s standard library.
- It measures how much time your program spends in each function and how many times each function is called.
- Useful for identifying performance bottlenecks and slow functions.
Key Columns in Profiling Output
| Column | Meaning |
|---|---|
| ncalls | Number of calls made to the function |
| tottime | Total time spent in the function itself (excluding subcalls) |
| cumtime | Cumulative time spent in the function and all subcalls |
| percall | Average time per call (`tottime / ncalls` or derived from cumtime) |
How to Use
- Basic usage:
1 2 3 4 5 6import cProfile prof = cProfile.Profile() prof.enable() # Run your code prof.disable() prof.print_stats() - You can also wrap calls directly using:
1prof.runcall(function, *args, **kwargs)
What is pstats
- A companion module for formatting and sorting the results of cProfile.
- Lets you sort by various metrics (time, filename, number of calls, etc.).
- Example usage:
1 2 3 4 5import pstats from pstats import SortKeyst = pstats.Stats(prof) st.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats("pattern")
Common Methods in pstats
| Method | Purpose |
|---|---|
| strip_dirs() | Removes long directory paths for readability |
| sort_stats() | Sorts results by criteria such as `SortKey.TIME`, `SortKey.CUMULATIVE`, `SortKey.FILENAME` |
| print_stats(“pattern”) | Prints only matching functions (e.g., containing “stack”) |
Example Workflow
- Create a cProfile.Profile() object.
- Run code inside prof.runcall() or between prof.enable()/disable().
- Pass the profile object to pstats.Stats().
- Sort and print results with desired filters.
Typical Analysis Goal
- Identify which functions dominate execution time.
- Compare different implementations (e.g., linked list vs dynamic array stack).
- Optimize functions that have high cumulative time.
Queues
- Queues are based on LIFO principle
- A method to insert an element into the queue is called
enqueue - A method to delete an element into the queue is called
dequeue - Queues are widely used in computer science and programming, including messaging systems, networking, web servers and operating systems
- A queue can be implemented using either arrays or linked lists to store its elements
- Using linked lists, enqueue and dequeue take O(1)
- Using static arrays, we can implement a linear queue, which can only support a fixed number of enqueue operations. Alternatively, we can implement a circular queue, where the array is imagined as a circular container.
- Advantages of choosing array over linked list for queue
- memory efficiency: an array will require less memory than a linked list to store the same elements. It has only a constant space overhead regardless of the number of elements in the array
- memory locality - single chunk of memory where all elements are stored side by side
- performance: operations on arrays are faster than on linked lists
Priority Queues and Heaps
- There are two important methods that we need to include in the interface of a priority queue - one to add a new element to the queue and one to get the element with the highest priority
- Priority queues can be implemented using various data structures, but the maximally efficient implementation is achieved using heaps
- One can use the following sorted or unsorted arrays, sorted or unsorted linked lists. The issue is you either have to have O(n) for insertion or O(n) for top
| insert | top | |
|---|---|---|
| Sorted static array | O(n) | O(1) |
| Unsorted static array | O(1) | O(n) |
| Sorted linked list | O(n) | O(1) |
| Unsorted linked list | O(1) | O(n) |
| Heap | O(logn) | O(logn) |
- Heap is a data structure that does not straddle between the extremes. It gives a O(logn) for insert and top
- A heap is a special kind of tree with three properties
- Binary heap is a heap where each node of the tree can have at most two children
- The heap tree is almost complete, which means that every level of the tree, except possibly the last level, is complete; furthermore, the nodes on the last level are as far left as possible
- Each node holds the highest priority element in the subtree rooted at that node.
- Heaps are better implemented as an array. This is possible because a heap is an almost complete tree
- With the array implementation of a heap, we can build a priority queue where insert and top take logarithmic time
- It is possible to transform an array of n elements into a heap in linear time,
using the
heapifymethod - The explanation in this chapter is so elaborate and elegant that any newbie can follow it. It took me 30 min to read through the logic and understand the code behind heap. I don’t remember when was the last time I looked at heaps and actually thought about the inner workings. The author uses a ton of visual to illustrate the pseudo code behind the workings of a heap
- The example used in the book to illustrate the use case for heaps is - finding the k largest entries in an array. Naive way would be sort the array in the descending order and take the first three elements. But that would be overkill for large k and large n. It would be O(nlogn). Instead using a heap makes it an O(nlogk) which is much better.
- Relearned the concept of min-heap and max-heap. May be it was way back in 2012 or 2014 that I had a chance to read through this stuff. I am revisiting this after almost a decade
Binary search trees
- Trees can be used to implement several abstract data types
- A generic tree is a composite data structure that consists of nodes connected by links. Each node contains a value and a variable number of links to other nodes.
- terminology: root, leaf, internal node, descendant, children, subtree, height of the tree
- A node N is an ancestor of a node M if N is in the path between the root and M.
- A descendant of a node N is either one of the children of N or the descendant of one of the children of N
- All children of the same node are siblings. There is no direct link between siblings, so the only way to get from one sibling to another is through their common parent
- Tree can represent a partial relation, where there are elements that are comparable and ordered and others that are not comparable or have no relationship between them
- Binary trees are defined by restricting each node to a maximum of two children. The order of children is not important
- Trees are a very large class of extremely versatile data structures
- Trees can also be used as containers. BST is one container that is a tree, it’s binary and so each node has a left and right child, and it’s used for searching
- BSTs have one advantage over sorted arrays: insertion and deletion can be faster on a BST
- To go from binary tree to BST, one needs to add a constraint on the data
stored in the nodes
- For any node N that stores a value v, all nodes in the left subtree of N will have values less than or equal to v, and all nodes in the right subtree of N will have values greater than v
- Out of the operations defined on a BST, the delete and traversal operation are the most complex operations
- For a BST, there are three ways to traverse it
- Pre-order, where we visit each node before its subtrees
- Post-order, where we visit the subtrees of a node before visiting it
- In-Order, where, given a node N, we first visit its left subtree, then N, then its right subtree
- Finding a predecessor and successor are trivial operations in linked lists and arrays. However these operations are complicated in a BST
- Deletions can make a BST skewed and that affects the order of the algo. It might become close to O(n)
- There is a need for tree balancing because if we perform a lot of deletions on the tree, we get a skewed tree
- A BST is said to be height-balanced if for any node, the difference in height between its left and right subtrees is at most 1, and both of its subtrees are also balanced
- The most used balanced search trees are read-black trees and 2-3 trees.
- With a balanced binary search tree (BBST) operations such as search, insertion and deletion can be performed on the tree in O(log(n)) time.
Dictionaries and Hash Tables
- Common use case for dictionaries is to remove duplicates from a collection.
- Sorting an array and removing duplicates from the array is
O(nlog(n)) - ADT for dictionaries is
- To insert a new value
- To retrieve the value associated with a key
- To delete a value
- Whether we use an array or linked list, sorted array or sorted linked list, one usually comes across O(n) either in insertion or deletion or search
- if one uses a BST the order id
O(log(n)) - Hash tables are a way to leverage the advantage of array based storage, constant
- One can simply take a base 256 representation of the key and then store it in the index associated with the base 256 representation. This leads to storing data in sparse arrays and becomes terribly inefficient in many practical examples. What is needed is a way to create an index of a reasonably small sized array, as compared to the values keys can take
- Using direct access table for storing associative data structures, leads to massive waste of memory
- Array used to store elements indexed by hashing is called a hash table. Hashing can describe the entire process of getting from an object to its index.
- Hash function can take the entire object as input and return a valid index
- Requirements of a hash function
- The domain of a hash function must be the set of all possible keys.
- A hash function must return a valid index
- Division and Multiplication methods can be used to generate indices based on keys.
- There are two types of conflict resolution mechanisms that are popular
- Chaining: Add the elements as a linked list to a specific index
- Open addressing: Probing all the cells until one either find what we are looking for or we have probed all cells
- If the hash function used is deterministic or easily guessed by an attacker, it is possible to design a sequence of keys that will cause the hash table to perform very poorly. This originated a vulnerability in servers written in several programming languages
Graphs
- A graph is a pair of sets:
- a set of vertices: entities that are independent of each other and unique
- a set of edges: an edge is identified by a pair of vertices. The first one is called the source vertex, and the second one is called the destination vertex
- An edge whose source and destination is the same is called a loop
- Simple graphs are graphs without loops, with at most one edge between any two vertices
- Multigraphs can have any number of edges between two given vertices
- An edge can have a numerical value associated with it. Such a value is called a weight and the edge is called a weighted edge
- A graph is sparse if the number of edges is relatively small
- A graph is dense if the number of edges is close to maximum possible
- In a directed graph, an edge have a direction. In an undirected graph, an edge can be traversed in both directions
- A connected component is a subgraph where all vertices are connected
- A directed graph is weakly connected if the undirected graph obtained by replacing directed edges with undirected ones is connected
- Two vertices are strongly connected if there is at least one path in the graph that goes from source to destination and one that goes from destination to the source
- Graphs are implemented as an adjacency list or adjacency matrix
- Breadth first algorithm explores the graph starting from a start vertex
s, by expanding a frontier of vertices connected tosuntil it reaches the target vertex. One can use queue data structure. It is about finding shortest path between a source node and destination node in a graph - If the edges are weighted, one uses a refined version of the algo called Dijkstra’s algorithm
- Depth first search algo is another find of algo popular in the graph algorithms space where there is a need to find connected and strongly connected component, understand if a directed graph is acyclic, and find a topological sorting for DAGs
- DFS can traverse all edges and visit all vertices of a graph, so its running
time is
O(n+m)
Takeaway
“Grokking” book series are fun books to read as authors make significant efforts
to distill complex subjects in to visuals and text that can be understood by
all. For any one looking to have a high level overview of the most popular data
structures used in the computing world, this book doesn’t disappoint. Most of
the data structures are explained using python and hence makes it accessible
to a larger audience. The author has also written an advanced book on data
structures for those who want to dig deeper.