Deriving Data Structures

Data Structures are the basic building blocks of software. In this article, one eBay engineer gives his perspective on how they might have evolved over time, enabling us to build complex things out of simpler parts.

When I started learning about data structures, I wondered how some pretty cool ones might have evolved over time and how things might have been invented on top of other ideas. It’s like someone invented a wheel with no knowledge that one day it might be used to build a car. But as with everything else, concepts start adding up together over time and we are able to build complex things on top of other things.

It all started with arrays, one of the oldest data structures. A pretty neat idea: a collection of identical things, but, more importantly, stored in contiguous memory locations. Any particular index accessed within O(1) time resulted in fast random access. Why contiguous? We want fast random access, and the fastest way to do it would be for the compiler to:

  1. Get the base address of the array.

  2. Calculate an offset by multiplying the array index by the size of each element.

  3. Add the base address to the offset to form the address of the desired array element.

  4. Load or store to the desired element using the calculated address.

This allows us to get the exact memory location needed for random access. In fact, with index registers, the content of the register can be added to an immediate address (one that is part of instruction itself) to form the “effective” address of the actual data. It makes sense to have contiguous memory assigned in the case of arrays. Any addition on elements more than what we estimated at initial size means we need to copy over all of elements and find another free space in memory for all of the array. The first digital computers used machine-language programming to set up and access array structures. John von Neumann wrote the first array-sorting program (merge sort) in 1945, during the building of the first stored-program computer.

Linked lists are little different. They just allow you to put data in and link them together like a chain. There is no random access, since you have to traverse from the start nodes. Again, you might ask: “Why would I need this? I have the array data structure to give me random access, so why should I live with this limitation?” The reason is that linked lists need not be contiguous in memory. You can allocate nodes and just connect them via pointers. Pretty cool! No need to be in order in memory. This gives you flexibility when you don’t need random access and have no idea how much data you have to store. Preserving memory would have been super important on older computers, so the linked list data structure allows us to just get the right memory for any dynamic arbitrary data set. In fact, linked lists were developed in 1955–1956 by Allen Newell, Cliff Shaw, and Herbert A. Simon at RAND Corporation as the primary data structure for their Information Processing Language. If you go through a few pages, it mentions “Flexibility of Memory assignment.” The language and structure the authors proposed was used by the authors to develop several early AI programs, including a chess program. 1

Then came the idea of hashing. What should be the fastest way to find a number in a list of numbers? Logically, a binary search after sorting the numbers in O(log n) time. The genius is to realize there might be a way better than O(log n): apply a function on the input number itself that will tell us the index to use to fit in the array. Also, instead of N entries, let’s make an array with M entries, where N can independent of M. Even if there is a collision, we can insert this entry and form a linked list of collision entries, and then sequentially search each collision. This method would be still faster than searching all N entries. Notice the brilliance of combining the array and linked list data structures with a mathematical function to come up with super cool hash table. So, we solved the problem of lookups efficiently in O(1) time via the miracle of the hash function to point us to the array index. Hashing was invented in 1953 by Hans Peter Luhn2

Now, let’s extend that thought even further. We can store millions of numbers in a hash table and quickly find out whether a number is present. What happens if we want to store a billion or trillion numbers, or instead of numbers, we want to store a larger data set, for instance, user addresses? In theory, we can still work with what was invented (hash table), but the problem would be the memory needed to fit everything in. Storing such a big list is too memory intensive and almost non-practical.

What if we store only a bit of information instead of actual data? Since a hash table is backed by an array that holds the data, what if we have an array of bits? If we just want to find out if something exists or not, we don't need to store the actual data. We can just store a bit instead, like a bitmap. But now there is a problem. Remember when we had collisions in hashing, we stored entries as linked list that allowed us to search. What will happen in a bitmap now? We cannot know for sure whether the data exists, because there can be a collision case, and this entry was actually not present in original set. It seems we are stuck.

But here comes another cool idea. What if instead of using one hash function, we use two? How does that help? Here’s how: one function maps the data to an array index, say index x. Now another function will map it to another index, say y. Now we apply the data to two hashes and set the bit of our bitmap at both locations x and y. What’s the benefit? Now if there were a collision for some data, the chances of having a collision for two hash functions are remote. If both hash function indices (x and y) have the bit set, there is a higher probability that data is present. We can increase our array size and have 3, 4, 5, or more hash functions for each input we get to set in the bitmap. This way we can be almost sure that the data exists — not 100%, but almost. We can tell if a particular piece of data is present in a large dataset with minimal memory overhead and with almost certain probability. This is nothing but bloom filters, which were originally conceived by Burton Howard Bloom in 1970. 3

The idea to combine multiple hashes into one data structure is ingenious. But what if we wanted to find not only whether something existed, but also how many times it exists? Let’s say I have a list of million tweets, and I want to know how many tweets came from one user. The idea is to use multiple hashes — (N) again — similar to bloom filter. But instead of backing it up with one array, as for a bloom filter, we back up with N arrays. In other words, if we have 3 hashes we use 3 arrays. Think of it as a table of 3 columns (one for each hash) and rows corresponding to entries of each of the arrays. Now we set a bit in each of the arrays for a given input by hashing it against them. In this case we set 3 bits for 3 hashes in 3 arrays. When we get new tweets with a given user ID, we do the same, but now if there was a bit set to 1, we increment the bit to 2. Every time we hash, we increment the array index value by one across all hash functions. How did that help? If we have to find out how many times something appeared, we need to look at the array indexes via applying our hash functions at each of the arrays and then take the minimum of these. Why the minimum? There is a possibility that it was incremented because of collision, so we can never be sure. But the minimum is the best possible guess we can make since the more hash functions we use, the probability of collisions decrease, meaning hopefully the minimum value had zero collisions and it is the right number of times that the data exists! This cool idea is what is called a count-min sketch. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005. 4

So, we solved the problem of finding whether x exists and the counting how many times x exists. Now let’s move to a different one. How to find differences between x and data blocks. Imagine when one computer sends lots of data to another, how can another computer verify that all the data is accurate and not manipulated? We can use the same magic of hash, but now we can put all hashes of records and form a tree out of it, as shown in the following figure:

hash tree

Now if we send A, B, C, and D, and the top root hash is stored and can be verified — say from a trusted party — any computer can verify whether the data is sane or manipulated by comparing the top hash. If the root nodes hashes did not match, go down one level to find first mismatch. Keep going down, and you can find the culprit data block that was manipulated. This is concept is known as a Merkle Tree, which were invented and patented by Ralph Merkle in 1979. 5

Notice how we came from array and list and invented hash table, and using hash table, we invented Bloom filters, and then Count Min Sketches. Then we combined hashes with trees to invent Merkle trees. Pretty cool, isn't it?

Essentially there are three different classes of data structures, and how they might have evolved:

  1. Core Data Structures — These are the classic structures NOT dependent on anything else. Array and Linked List fit this category. The two structures are as basic as can be and follow properties that we all know, and we know when to use which one, with their benefits and limitations. Even stacks, queues, trees, and graphs can fit in here.

  2. Partially Derived Data Structures — These are things like HashMap or HashSet. They introduced the concept of hashing, but they cannot exist on their own, since they need to be backed by the core data structures, which in this case is an array. People might have used it as an extension of arrays.

  3. Fully Derived Data Structures — These are things Bloom Filters, Count Min Sketches, Merkle Trees, etc. that are fully derived from core and partial data structures.

It’s interesting to think how these things might have evolved and what problems would have led to their discovery. Some of these structures are nothing but pure magic, and understanding them gives an “aha!” moment that reveals the genius behind them.

What is your favorite Data Structure? Please share in the comments section below.

References:

  1. Programming Logic Theory Machine

  2. Hans Peter Luhn and the Birth of the Hashing Algorithm

  3. Space/Time Trade-offs in Hash Coding with Allowable Errors

  4. An Improved Data Stream Summary: The Count-Min Sketch and its Applications

  5. A Digital Signature Signature Based on a Conventional Encryption Function