Data models and data storage

User Interfacing Newsletter Volume 134

Jul 15, 2021

Hi folks 👋

Continuing my systems design series this week with two new articles (one of which I forgot to post last week). If you’re doing a systems design interview soon or you just want to know more about web-scale applications continue to follow along!

Data models and query languages

This chapter focuses on how you can organize and retrieve data from a database. It primarily focuses on the three primary data modeling paradigms:

Relational
Document
Graph

Relational model

This is the incumbent. The relation is the table. The individual datum within the table is the tuple called the row. Properties of a table exist on the column. You know, like any other table you would see in the real world.

I grew up with these database systems. They are what I studied in college. My master’s thesis was on RDBMS. They have been the dominant force in data management for most of the last 40 years.

The most cited open-source examples of relational databases are MySQL and Postgres. They’re still useful and are the first choice for many frameworks like Ruby on Rails or Django. It’s just that they suffer from a few problems:

They cannot scale without considerable effort. They are much harder to replicate or partition than they should be.
Limited expressiveness. Querying and organizing relational schemas fits a specific mold. You have to be okay with that or else you’ll have to seek alternatives.

So what are the alternatives?

Document model and the rise of NoSQL

NoSQL (which people would now call “Not Only SQL”) describes any kind of database that doesn’t follow the relational model. One thing I want to highlight is the term polyglot persistence. This refers to the idea that you can have both relational and nonrelational databases working together. It’s not like you have to choose one paradigm over the other. In fact, it might be advantageous to use both for your application to serve different needs.

I think this is important to call out because NoSQL implies that you’re specifically not choosing SQL. It’s as though they can’t work together in an application but in fact, they can and do.

something like a resume could be a series of joins between a user and their job experiences OR you could nest those joins into one cohesive JSON model
just one thing to fetch
- it actually looks like the thing you’re representing (a resume is a document) - also means many-to-one and many-to-many relationships don’t work so well
- fast: requires no joins since it is self-contained (also means poor support for combining disjointed data)
- highly flexible: no federated schema (also means no referential integrity)

How do you know the document model is right for you?

Are you working with documents like profiles, resumes, HTML pages? It’s great. If you can populate that whole document it’s very efficient. If you have a lot of nested relationships or links to other documents it starts to become cumbersome.

A list of jobs might be two separate tables in a relational model. In the document model, they will all be listed out. The exception is if you have created a separate document for all job types. At that point, you’re heading in the wrong direction.

One-to-many, tree-like data structure? Good. Many-to-many, graph-like data structure? Bad - stick with a relational or graph database.

So when does something look like a document in the real world? The two often-cited examples I see are:

CMS. The primary object of a blog engine is a blog post which is a document. The important thing is that it is unstructured. The content of the blog doesn’t need to fit a particular format.
E-commerce product catalog. Imagine a service like Amazon. They have tons of products. Other than the SKU there is really no mandatory format for structuring how you display a product. This falls into the large volumes of unstructured data category which could be a good fit for a document model database.

Mongo is to Python as Postgres is to Java

This analogy just kept swimming in my brain. One of the key benefits (and drawbacks) of a document database is its loose constraints on schemas. The book makes an interesting analogy in that it’s like runtime compilation: no schemas are validated on writes on the way into the database but you can validate a schema on read. This means you can only be assured of the structure of the document when it is being read from the database.

SQL is to CSS as JSON is to the DOM

In a similar but not nearly as clean analogy, we can further think of the paradigms of database programming languages. Relational databases use declarative SQL like CSS. They optimize around patterns rather than algorithms because you just declare how you want to filter out your data and you let the language choose the algorithm. This allows for terse, parallelizable code where the language can handle performance optimization and you focus on just getting your data.

Since NoSQL databases don’t necessarily use SQL their method of storage and retrieval revolves around their documents, commonly stored in formats like XML and JSON. In theory, you can use any old imperative programming language like JavaScript to interact with the database. This is nice in theory since you are programming UIs and business logic in the same programming paradigm. The downside is that your code is likely more verbose, uglier, and requires a specific algorithmic sequence to execute. Think of it like styling a page with JavaScript; you could - but why would you do that to yourself?

Writing is not my strength…

Reads are the name of the game here. Small, unmodified documents are best. Anything else you risk heavy performance penalties. Documents get rewritten as a whole so minor updates to big documents are a huge drain on performance. The benefit of having all of the data shoved into one document is to remove the need for complicated things like joins. The drawback is without joins you have to colocate data. This can become expensive if the document becomes expensive.

…and it may not matter

Postgres supports JSON columns. Mongo supports foreign references in a weird sort of join operation. The truth is, the document and relational models may not be all that different soon so you aren’t locked into one idiom between the major database providers anymore. The book goes into an example of this with the MapReduce extension on MongoDB. I didn’t think the example was worth stating here but the last sentence on this topic is a great summary of what I took away from this section:

The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise

Graph models

Remember how earlier I said one-to-many relationships were a good fit for document databases and many-to-many were good for graph or relational databases? Now let’s pick that apart further.

If your many-to-many relationship is simple, you can probably get away with a relational database. For complex many-to-many relationships, a graph database is a good choice.

What’s an example of a complex many-to-many relationship? You already know canonical examples:

Social networks. People know lots of other people. Just ask Kevin Bacon.
The web. Web crawlers are designed to demystify the tangled mess of connecting and indexing every website.
Transportation. Roads, railroads, and airports all connect points across the globe. This is a hard problem and one I’ve explored before.

To be honest, I found this section to be a bit dense on specific kinds of graph databases that may not be relevant today so I’ll gloss over each section with the critical parts I took away.

Property graphs and Cypher

Neo4j is the example I see all over the web on graph databases. I think if you just knew this one and Amazon Neptune (which seems to support all kinds of graph database models) you’d be fine.

This is how I would imagine you would design a graph data structure at scale. There are two main objects in a property graph DB: the vertices and the edges. Like a database, each of those things has a UUID. They also have information as to where they go and who uses them. And as the property name suggests, each of these objects also has metadata in the form of key-value pairs.

The power here is in the flexibility to build graphs and the tagging you can store to find your vertices and edges. Property graphs don’t restrict how you construct your graph. Graphs don’t require a schema other than the basic vertex and edge. So you’re free to design it however you like. The power of the model is in how you relate things.

These key-value tags make it easy to find information in the graph without ruining your data model. It also means you don’t have a traverse a graph to find information. Tags and IDs allow you to quickly lookup vertices or edges without having to go through all of the graph connections. They’re like the shortcuts to traveling across the network.

If you need to evolve your features it doesn’t ruin the data. The key-value pairs allow you to create new associations between your core objects without your graph fundamentally having to change. Sounds like Facebook would be a good example of this: people are your edges and their connections are the vertices. In the old days of Facebook, you could list out your favorite bands or movies. A property graph database would allow you to add something like TV shows without having to totally screw up your existing graph or its associations.

Cypher is to property graphs as SQL is to relations. That’s all you need to know here. It’s a declarative programming language for Neo4j in the same way SQL is a programming language for databases like MySQL, Postgres, MariaDB, and SQLite.

Triple-store graphs and SPARQL

This is another graph database variant. Instead of vertices and edges as the primary objects a triple-store graph uses, you guessed it, three objects: subjects, predicates, and objects. This one intuitively makes less sense to me than a property graph because I don’t think of a graph as I think of an English sentence.

Here is how you make sense of these three things:

Subjects are always vertices. They are nouns and stuff like people and fruit.
Objects and predicates have 1 of 2 states.
1. The object is another vertex. Then the predicate is the edge connecting subject and object together.
2. The object is a primitive type. Then the predicate is a property where the object is a data type of the subject.

The book lists Datomic as an example of a triple-store graph but it doesn’t market itself as a graph database but an amalgamation of many flexible data models. For practical purposes of this audience, you’re likely not to run into a scenario like this.

SPARQL is to triple-store graphs as Cypher is to property graphs. SPARQL is the programming language for triple-store graph databases using the RDF data model for the semantic web. Given that the Semantic Web never really took off, it’s more of a reason why you don’t need to invest more than a few paragraphs into understanding these things.

OLTP. This includes your traditional relational and NoSQL databases. They comfortably store terabytes of data, are highly available, and are designed for the end-users of your applications.
OLAP. This includes data warehouses from names like Oracle and IBM Db2. They store into the petabytes of data, are primarily read-only copies of several OLTP databases, and are designed for business analysts.

One thing to note about OLAP data stores is the simplicity of their data models. While the entire last chapter was dedicated to the variety of OLTP data models, there are really only two models in use for OLAP systems:

Star schemas, named after the star-like shape representing the central fact table that points out to supporting dimension tables
Snowflake schemas, which add on sub-dimensional tables to the above star schema with a more complex, snowflake-like pattern

The seemingly obvious comparison is that snowflake schemas are more complex because of their additional normalization with sub-dimensional tables. Conversely, star schemas are simpler to conceptualize with less branching. Either way, both support lots (100+) of columns.

Column-store databases for big data

Wide-column fact tables benefit from column-store databases. As the name suggests, you’re inverting the way data is stored and retrieved. Rather than store data in rows, where each entry corresponds to a single object and various properties of that object in columns, you store all of the values for a single property together.

Since a massively wide fact table is likely only utilizing several columns in a given query, data organized by column lends itself well to these kinds of analytics queries. You can partition and shard data by column which makes searching against particular columns even more efficient.

Since all data in a given column adheres to the same type, you can think of a column as one giant array. This makes column-store databases effective at compressing data with techniques such as bitmap encoding. With more stuff able to fit in memory, column stores can effectively use computer caches to leverage lots of CPU cycles to iterate through data. You can further improve efficiency when grouping or filtering by sorting your columnar data.

Finally, column stores can leverage highly specialized materialized views called data cubes. Data cubes make it easy to reference a lot of aggregated data. This is a way to precompute data and display it for fast retrieval.

All of these make reads incredibly efficient for even the largest data sets. The clear downside is the cost to write data. A table with 500 columns might now be spread out across 500 files, one for each column. That means adding a new entry means writing 500 properties to 500 files. If that entry just happens to max out the file size on disk you’ll have to perform even more writes to partition and segment out your data. In other words, writing is expensive. As we’ll see later on in this post, there are tools like LSM-Trees that can effectively speed up data storage.

Data retrieval techniques

Storing data is straightforward. Write data to an append-only log and save the file to disk.

Retrieving data, on the other hand, that’s where things get interesting. The book presents a progressively complex way of retrieving data efficiently.

Hash index

Indexing is the general strategy for retrieving data quickly. By writing only the data you search against into memory you trade space for speed. The downside, besides the additional space requirements, is the cost to write the data. Data stored in multiple places requires multiple writes to disk and takes more time.

One way to index is with a hash code, just as hash maps/tables relate keys to values. An example would be Bitcask. Bitcask is the storage engine powering Riak.

As we saw from the YouTube example, we need to count video views. The video offers Cassandra as the data storage mechanism of choice. The book might suggest that Riak with Bitcask might be the ideal choice. Bitcask uses an in-memory hash table separate for what is stored in disk. This, in turn, makes frequent writes still fast on read.

Everything about this sounds great, right? We have a method that is:

Scalable: we can shard the data on disk while keeping the indexes in memory
Fault-tolerant: the database can go down in a separate data center that stores the index
Eventually consistent: indexes can now be updated independently of the data stored on disk
High throughput: the right hashing strategy limits collisions which allow for a lot of conflict-free storage

There are two limitations with hash indexes:

Slow range queries on indexes. Since the index key is a code and the codes are unique you don’t gain any speed searching a range of values.
Size limitations. If you run out of memory the whole thing falls apart. Extreme-scale applications that fill a hash table index completely are the one exception to scalability.

SSTable

The next evolution of hash indexes is Sorted String Tables, also called SSTables. SSTables provide some additional benefits:

They only store keys once, saving on storage space.
They sort the keys to further speed lookups.
They use a variant of merge sort to merge and compact segments.
The organization of that data can be implemented as a red-black or AVL tree, allowing you to write in any order and read in sorted order. This makes it easier to store in memory.

I didn’t touch on segments earlier. A segment is just a means of sectioning off a data log. To prevent running out of space you need a means of separating your indexes across files. To reduce the space requirements we compact the segments by removing duplicate keys and only keeping the most recent write to that key.

This implementation of an in-memory balanced tree for keeping a cascading tier of SSTables is aptly called a Log-Structured Merge-Tree or LSM-Tree for short. Lucene, the Apache search library that powers ElasticSearch and Solr, uses something like an LSM-Tree to store its dictionary of terms and SSTable-like files to map those terms to postings.

More specifically, if you’re dealing with a dictionary of words you’d want to use a structure that tailors itself to strings instead of numbers. We’ve already evolved past a hash table so that is out. A self-balancing binary tree is out, too. Suffix/prefix trees like tries are specific to strings but suffer from the same problem of taking up lots of space and eating away at RAM. So what’s left? That’s where the mighty Bloom filter comes to the rescue.

A Bloom filter is a probabilistic data structure that is sort of an inversion of other data structures. Rather than using the information to find if data is contained within a structure, a Bloom filter tries to determine if something is definitely not in the set or if it might be in a set. The “might” part here is key because it’s not guaranteeing validity. There is a chance for a false positive.

With a sufficient number of hashes, you can obtain a low false-positive rate to obtain extremely accurate results with a very low amount of indexing space. This solves our biggest problem with indexes when the data is really large and you’re worried you will run out of in-memory storage for your index. A Bloom filter can handle extreme scale and still perform well with the tradeoff that your results aren’t always 100% accurate.

B-tree

We’ve talked about efficiently retrieving data in memory. What if you want to retrieve data on disk? The B-tree is the most widely-used and heavily studied database index lookup tree. I had to build one of these for my relational database class in college. It’s another self-balancing tree structure used in all of your major RDBMSs like Postgres and MySQL.

While both LSM-Trees and B-trees contain key-value pairs sorted by key, the strategy for separating data is completely different. The Log-structured indexes can be of variable size. They can also be quite large, spanning into multiple megabytes. B-tree indexes, by contrast, are always fixed blocks of typically 4 KB or more. Since disk pages are also aligned in fixed blocks, the B-tree is better suited for sorting and retrieving data on disk instead of in memory.

B-trees are structured in layers so that the lower levels represent the range of values in between two values in the parent layer. As we saw earlier, in-memory indexes suffer from poor performance in range queries. B-trees do not.

The branching factor determines how wide each layer of the tree is before the range must be split down to a lower layer. For conventional databases with these 4 KB pages, you could have a branching factor of only 500 and only need to traverse down 4 levels before you have stored 256 terabytes of data. That’s a flat and efficient tree!

In terms of system design, how does this compare in performance to the in-memory solutions?

It is also scalable because the partitions are small, 4 KB pages.
The data structure inherently supports range queries as a result of the branching factor.
Fault tolerance is achieved through a write-ahead log which writes the data changes to the WAL before it is written to the B-tree (hence, write-_ahead_).
Locks on the B-tree called latches can be introduced to preserve consistency in an ACID-compliant manner, achieving a higher level of consistency than with a log-structured index.

When comparing log-structured against B-tree indexes, the main rule of thumb is to consider performance on reads or writes. Though performance testing will confirm this for your implementation, log-structured indexes are thought to be better for writes. Conversely, B-trees are considered better for reads. This is because LSM-Trees frequently write, segment, and compact their SSTables to create efficiency.

On the other hand, B-trees offer a more reliable indexing scheme because the keys are only written once and do not suffer from a high level of writes due to compaction. The pages are separated into small, predictable pages and do not require sizable computation for write compaction and segmentation. This is particularly useful for transactional databases when ACID guarantees are a must. That would explain why B-trees have stood the test of time for so long.

Other indexes

The book offers a few more indexing suggestions, such as secondary indexes for joins, clustered indexes where the value is a heap file rather than the raw row data, multi-column indexes like R-trees, and fuzzy-search indexes like those we talk about with ElasticSearch and Solr.

Then some indexes are colocated with the whole database in memory, such as the case with Memcached or Redis. Now the entire thing, both the index and the raw data, are stored in memory. You only need a hard disk for fault tolerance and durability. Further, in-memory databases allow for more flexibility in the data structures they utilize. It is not uncommon to interface with a priority queue or set with something like Redis.

🙏 That’s all folks 🙏

Aaaand we’re done with issue 134 of the User Interfacing Newsletter.

If you got something out of this newsletter, feel free to forward it to your friends, family, and/or coworkers to help it grow.

Interested in sponsoring the newsletter? Have questions, comments, or corrections? Hit that ↩️ reply button and let me know what you're up to!

User Interfacing Newsletter