How to design a read-heavy distributed key-value store ?

Nikasakana
10 min readDec 7, 2022

--

I bet you have heard or read about the ever increasing popularity of NoSQL databases and especially distributed key-value stores like: Cassandra, DynamoDB and etc. Maybe you have seen news like:

Discord migrated to using Cassandra as primary chat store”,
Amazon’s internal services mostly use DynamoDb”,
Facebook is enhancing it’s usage of Cassandra even more

But have you ever asked why are these fancy companies, all using same technologies ? Or have you ever wondered how are these types of storage systems designed internally ? What are architectural decisions and mechanisms that drive them ? Well, i for sure have… a lot. So as a result of my curiosity this article was created. It represents my journey to understanding the core concepts of designing distributed key-value store.

Note: This article is by no means a thorough system design solution of distributed key value store. In the best case(for you), this article will help you understand new system design topics or clarify already familiar ones. In the worst case (again for you) you will notice some inconsistencies or design errors. In the latter one i encourage you to ping me in the comments, tell me how it can be improved or even better, give me some resources to fill the gaps in my knowledge and turn your worst case scenario into my best case one.

About system design strategy

Fig. 1) Strategy diagram for iterative system design

Instead of directly providing “correct” and overly complex system design to the problem, this article will deal with understanding thought processes of arriving at various design solutions and analysing their pros and cons. We’ll start from the simplest possible solution and gradually improve from that, to what hopefully is a design of a fully operational/modern distributed key-value store.

From now on i’ll call this kind of design process: “Iterative system design”. You can see the strategy of iterative design on the “Fig. 1” above.

Requirements

Before diving right into the design process, let’s try to analyse the requirements. For those of you who aren’t familiar with the term: “system design requirements” , its is simply a set of technical/non-technical constraints that the system should operate under(ex: “system should be up and running 24/7”, “write latency to the system should be <10ms” and etc).
Strictly following requirements is one of the most critical parts of designing any kind of system, since missing any one of them might make system totally unusable.

Functional requirements(As a user i should be able to):

  • put key:value pair in the storage.
  • get value associated with the specified key.

Non-functional requirements(Non business logic related constraints):

  • Able to store huge amounts of data ( To avoid actual numbers of volumes of data, let’s assume that the amount of data won’t be placeable on one machine )
  • Highly reliable ( There should be no single point of failure in any part of the system and data should be reliably stored after successful insertion)
  • Read scalable ( Be able to accommodate high traffic spikes and sudden increases in reads)
  • Highly available ( Minimal down time and no single point of failure )
  • Eventually consistent ( Read, issued right after a write operation might not return up to date value, this behaviour might be present for a little but non-deterministic amount of time )

We could also strive to dive into details of low latency & fault tolerance for this system, but it would add up a significant amount of cognitive load to the system(by having to sync data in memory and disk efficiently), so let’s delegate discussion of these topics for the later time.

Simple in memory key-value store

Designing a key-value store that resides on a single server and fully in memory is straight forward and simple. We could have a single OS process, with in memory dictionary and store all the data associations there. The main benefit of this kind of system is that: 1) it’s simple to design 2) it’s fast. As we see there are some pros for this kind of the key-value store system, so why can’t we take it as a good enough solution ?
The main reason is that it doesn’t satisfy the non-functional requirements defined by us. Specifically it is unreliable, unscalable and not highly available.

From the list above, we see that this solution satisfies all the functional requirements but doesn’t comply with any of non-functional ones. Which means that we should improve it. To do so, let’s analyse the concrete bottlenecks and problems associated with the current solution:

  • problem #1: Not reliable. If machine permanently dies we loose our data.
  • problem #2: Not highly available: If machine dies temporarily, key-value store is not accessible for clients
  • problem #3: Not scalable: If traffic suddenly increases we have hard limits on the scalability of our application ( constrained by the OS process and machine resources ).

All of these problems are mainly caused by the intrinsic property of our first, in-memory, solution called: “Single Point Of Failure”. An intuitive solution to this problem would be to not have single point of failure and have at least N operational points communicating via network, that will allow us to implement fail-over mechanisms. This exact mechanism, ladies and gentlemen, in terms of computer science is called: “Distributed System”.

CAP theorem(review)

CAP theorem visualisation

Before we dive into understanding different types of actual improvements we can introduce in a distributed key value store, let’s first understand one of the most important theory aspects of distributed system called CAP theorem.

CAP theorem resolves around the 3 potential properties of distributed systems that are:

  • C — Consistency
  • A — Availability
  • P — (Network) Partition Tolerance

The whole idea is that in case of Network partitions in the system, only one of: Consistency or Availability might be reached. Since distributed systems, by definition are network partitioned, for each and every modern system we need to choose whether to design AP(Highly available distributed system) or CP(Strongly consistent and distributed system).

From non-functional requirements, listed above, we see that our system needs to be highly available, thus we’ll trade off strong consistency and aim for eventually consistent system with high availability.

For those of you interested in CAP theorem in depth i recommend reading blog from Hazelcast and a blog post from Julian Browne. These resources will give you more than enough understanding of CAP theorem internal details.

Data partitioning

Data partitioning diagram

For huge applications, like large scale key-value stores, it’s infeasible to store the whole data on the one server. To understand this phenomena more concretely, think of a case that our application uses relational database and, that some of our tables become so large(contain so many rows) that it eventually becomes extremely slow or even worse, we simply get out of disk space to store ever increasing amounts of data ( Yes! these kind of problems exist in huge applications).

The natural and straight forward solution to this problem would be to partition data into smaller chunks and distribute them on many machines. For this to work, we’ll need to create one and the same schema/structure on many different machines and simply add an application level logic that follows next 2 rules:

  • evenly distribute data between existing machines.
  • in case of machine failures or new machine instance introductions in the cluster(during traffic spikes), adapt to new set of machines and redistribute data evenly.

Even though this idea sounds straight forward, it’s extremely hard to implement correctly. There are tons of challenges in the logic of evenly partitioning different types of data on geographically distributed machines. We won’t dive too much into internal details of different mechanisms for data partitioning but we should say that, one of the most well known implementations for this idea is called “Consistent hashing”. It seems to be one of the most reliable and interesting mechanisms for data partitioning as it’s described here.

For those interested in this topic further, i’d suggest reading Toptal’s blog about consistent hashing and Digital ocean’s guide to Sharding

Data replication

Data replication In Consensus/Quorum based system like Raft

To achieve reliability and high availability of the system we need to get rid of single point of failures(SPOF) for each distinct part of our system. One of the most famous solutions to solve SPOF is called “Data replication”. The high level idea of replication mechanism is to have N copies of one and the same data on geographically distributed servers. This will help us avoid loosing data in case of permanent machine failures and avoid unavailable system in case of temporal machine failures, since in both of the cases we’ll have a fail over nodes up and running.

To dive a bit deeper into actual replication mechanism details, there are many variations of data replication mechanisms such as: active-passive, active-active and etc.

Active-active ones are most complex and have huge cognitive complexity ( since they target scalability on writes as well as reads ), so for simplicity , let’s assume that we have “read heavy” system(and we don’t care about scaling writes), so active-passive replication where only 1 master node handles all the writes and N follower nodes handle reads is totally ok for us.

One of the most popular implementations of active-passive replication is called Raft Consensus Protocol. The high level idea of Raft algorithm is to have one Master node and N Follower nodes.

  • All the writes are handled by master node only. Each and every write command is considered successful if and only if the master node gets acknowledgments from majority of followers that this new value has been safely replicated.
  • Reads can be handled by any one of N follower nodes. Meaning that we can have dynamic number of followers depending on the stress of the system. One quite interesting side effect of replication is that we’ll be able to place replicated geographically closer to the clients that issue bunch of read requests, thus decreasing the actual latency for them.

The one tradeoff for this kind of replication system is that you will need the majority of your nodes to be up and running for your whole system to be considered as up and running. So if you have total of N nodes N/2+1 nodes should be up and running. The reason for this is that, as we already mentioned, each write to be considered successful majority of nodes need to replicate it. So, according to this rule if no majority of nodes are present all writes are failed automatically.

For those of you interested in diving into much more details of Raft consensus algorithm or generally data replication, i totally recommend github doc for Raft protocol and Hussein Nasser’s video about replication

Consistency guarantees

Before we dive into the topic of consistency for the key-value store that uses replication mechanism described above, i’d recommend you to watch this excellent youtube video. It will give you a good understanding behind the idea of consistency in modern IT systems(and also the opening joke is pure gold).

Data replication, described in the previous section, gives us high availability, reliability and other huge benefits, but causes potential inconsistencies between replicas. In “down to earth terms” inconsistency is the property of the system which states that upon issuing read query to the system, after preceding write queries, you can’t guarantee to get always up to date view of the state.

To better understand this problem, let’s go through a concrete example of an interaction with inconsistent, distributed key-value store that uses replication protocol:
*Commands are ordered in chronological order from top to bottom
*See diagram below for better flow visualisation

  • (time frame 1) client issues a write(K:V)
  • (time frame 1) leader of the cluster gets the request, persists it locally and asks followers to replicate it.
  • (time frame 1) majority of cluster’s followers successfully replicate the data which instructs the leader to return to the client “OK” status code, saying that write was successful.
  • (time frame 2) client issues a read request read(K) which lands on the follower that hasn’t yet been synced with the rest of the cluster
  • (time frame 2) client gets stale/no data as a result of read(K)
Diagram visualising the eventual consistency described above

To sum it up, user wrote K:V pair in the system(at timestamp 1), then tried to read value V for the key K(at timestamp 2) and read value at timestamp 2 wasn’t equal to value written at timestamp 1. From the client’s perspective this might mean that there is an inconsistency.

“That is a huge problem!” — You might say.
“However sad that might sound, for the systems that need high availability this property is a must, that is exactly what CAP theorem tells us” — I’ll reply.

But don’t get me wrong, the inconsistency mentioned above is not permanent, it’s just temporal. Eventually, all the state will settle down and all the node replicas will have up to date value, but before that there is a slight window where these kind of temporal inconsistencies can be observed.

This, my friends, is the one and only: “eventual consistency”.

For further deep dive into this topic i’d suggest to go through next resources: Martin Kleppmman’s video explanation of eventual consistency and Google Cloud’s blog about consistency in their storage systems

Thanks for reading along ;)

--

--