LSM Tree Architecture | Surender Thakran

Problem

Over the past decade the modern world has become more and more data driven. The amount of data generated and consumed has also exploded to an extent that traditional databases are no longer the silver bullet to all use cases as they were once thought to be.

Modern systems need storage solutions which in addition to handling varied read loads can also handle large data ingestion workloads while maintaining reasonable read performance.

Limitations in traditional storage engines

Storage engines in traditional database solutions are largely architectured around B+ Trees. These provide balanced read and write performances but as the data grows they start encountering performance challenges. Especially under high write workloads.

Traditional B+ Tree based storage engines operate with the following limitations:

For each write, in addition to writing row data on disk, B+ trees too have to be read, rebalanced and written back to the disk. This leads to substantial Write Amplification and makes database writes very compute & I/O intensive operations. Under high write throughput this turns into a major bottleneck.
These were designed for spinning disks and provide only block level addressing. Hence, to read or update even a small segment of data, entire disk blocks have to be read and updated which leads to even more Write Amplification.

Solution: LSM (Log-Structured Merge) Tree Architecture

With advancements in storage technologies, storage devices have become far more efficient and performant then the earlier spinning disk designs. With development of SSDs and more recent NVDIMMs, we now have the capabilities of byte level addressing. However, it is still not enough to handle the kind of write throughputs we require.

One of the viable solution is to write to memory first and then persist to persistent storage. This is where LSM Tree based architectures come into the picture.

Who uses it?

LSM Tree based database storage engines have proven themselves under tremendous workloads and we see a lot of database solutions built using this architecture. Some of the more prominent ones are:

Amazon’s Dynamodb
Cassandra
ScyllaDB
Google’s BigTable and LevelDB
Facebook’s RocksDB
InfluxDB

How does it work?

Storage engines built on LSM Trees primarily follow two basic design ideas:

First is to write to memory first before persisting it to disk asynchronously. To handle read queries on keys whose writes or updates are not yet persisted to disk, the database first reads from memory and fallbacks to disk only if the record is not found.
Second is to treat every write as an insert i.e. append writes to some data structure and post-process asynchronously to merge all writes against a key into a single record. Writing in append-only fashion makes writes very performant. This idea of append-only writes is not new and we can find inspiration in existing Log-structured file systems.

LSM Tree is not a single data structure, instead it combines multiple data structures residing in different storages of a machine to efficiently handle read and write workloads.

While dissecting the term Log-Structured Merge Tree, log-structured indicates that data is structured one after the other in a way similar to an append-only logs. The term merge indicates the techniques used to combine multiple data structures to better manage the data (more on this later). Lastly, the tree indicates the storage hierarchy of RAM and persistent storage used in the architecture.

To better understand how a database storage engine built using LSM Trees works, we first need to understand the key components of the architecture.

Components

Memtable

In an LSM Tree architecture, writes are first buffered in an in-memory structure called a Memtable which provides for fast random reads and sorted iteration on the data. The actual data structure used could vary across database implementations. It could be something as simple as a key-value Map or a much more sophisticated data structure like a balanced binary tree (AVL Trees or Red-Black Trees) or a Skip List.

Data in the Memtable is flushed to a persistent storage when the size of the Memtable reaches a certain threshold or when a certain amount of time has passed since the last flush. When flushing, the entire data in Memtable is written to persistent storage in SSTable files and the Memtable is emptied to accommodate further writes. The Memtable effectively operates as a write-back cache.

Write-Ahead Logs (WALs)

Since the data is kept in memory in Memtables before it is flushed to persistent storage, any hardware failure would result in loss of data in the Memtables. To provide for tolerance against such faults, every write is also simultaneously written to persistent log files. In case of failures, the database can replay the write instructions from the WAL files and re-create the Memtable.

A write is only considered successful after it has been persisted in the WAL file. Since WAL files exist only in persistent storage and do not maintain any in-memory buffer, writing to them is slower than writing to Memtables, effectively making WAL writes the actual bottleneck for writes in LSM Tree based architectures.

To improve write performance in WAL files, writes are only appended to the files starting from the last written byte offset.

SSTable

SSTable is a file format which stores key-value entries from Memtable in sorted order to allow quick random reads and scans in logarithmic time complexity.

In LSM Tree based architectures, there can be multiple levels of SSTables ranging from L0, L1, .. LN where L0 is the first level to which Memtables are flushed. Each level can have multiple SSTable files partitioned by the keys they store. In some implementations, SSTables at each level are also augmented with a separate Sparse index to assist in faster scans.

A background compactor job moves records in SSTables from newer levels to fresh SSTables in older levels (more on this later). SSTables are immutable once created.

Since LSM Tree architecture does not update records in-place and instead operates in append-only fashion, it often happens that an update to a record is still in a newer(lower) level SSTable while its original write or any previous update has since been moved to a higher level. Hence, a key can exist in SSTables in multiple levels.

Bloom Filter

Many implementations of the LSM Tree architecture also maintain a Bloom Filter for every SSTable to determine if a key does not exist in the SSTable.

A Bloom Filter is a probabilistic data structure which when queried to check if a certain value has been stored in it can answer in constant time if the value does not exist or may exist in the data structure i.e. it can give false positives but never false negatives.

The probability of false positives can be controlled by configuring the size of the filter and the efficacy of the hash functions used.

Another big advantage of using a Bloom Filter is that its size does not increase as more values are stored in it (i.e. it maintains constant space complexity) although the probability of false positives increases with more values.

Operations

Now we are ready to look at the workings of basic operations of a LSM Tree architecture database and understand how the above components come together.

Writes (Inserts, Updates and Deletes)

Writes to the database are simultaneously written to both a WAL file and the Memtable. In case of a hardware failure, the writes in WAL are replayed from the last flushed write’s offset in the file to re-create the Memtable.

Writes in the Memtable are flushed to a L0 SSTable file once the Memtable becomes large enough or when a certain time duration has passed since the last flush.

Deletes are also processed in append-only manner by writing a tombstone entry for the key to be deleted to the Memtable instead of deleting it in-place. A tombstone entry is a special record which declares that the key has been deleted.

Compaction

A background compactor job runs to perform compaction across SSTables in different levels. When SSTables in two neighboring levels are being compacted, the following actions are performed:

Fresh SSTables are created in the older level by merging the SSTables in two levels. For a key occurring in SSTables of both levels, the record from the newer(lower) level is kept in the new SSTables.
Keys with tombstone entries are not written to the final SSTables.
Sparse indices and Bloom Filters are created for the final SSTables.
Existing SSTables from the two levels are deleted.

Most implementations of LSM Tree use k-way merge algorithm to merge SSTables.

Reads

When reading a key, it is first searched in the Memtable. If not found, the L0 SSTables are consulted next. Since most LSM Tree architecture implementations use Bloom Filters with their SSTables, the SSTables are only scanned for the key if the Bloom Filter does not rule out the existence of the key. Since the SSTables are already sorted we can search for the key using binary search.

If either the Bloom Filter says the key is not in the SSTable or the key is not found in the SSTable after scanning for it, we search in the older(higher) levels until we either find the key (either with a record or its tombstone entry) or we conclude that the key is not in the database since it is not in any of the levels.

Since the architecture requires searching in multiple data structures and files, it leads to Read Amplification and makes LSM Tree architecture less performant for reads then B+ Tree based database solutions.

Conclusion

LSM Tree architectures are superior at handling high write workloads compared to B+ Tree however suffer from relatively slower reads.

LSM Tree architecture is also less space efficient then B+ Tree architectures since multiple versions of records can exist at multiple levels of storage and files. Additionally, the compaction process is also very compute, I/O and space intensive.

B+ Tree based solutions suffer for Write Amplification while LSM Tree based solutions suffer from Read and Space Amplification.

References

LSM Tree Paper