Key Database Compaction Strategies Used In Distributed System

Balachandar Paulraj
3 min readSep 4, 2023
DATA COMPACTION STRATEGIES

In the realm of distributed database systems, the adoption of compaction strategies plays a pivotal role in the effective management of data storage. As the data landscape continues to evolve, a multitude of innovative compaction strategies have emerged, each catering to specific database technologies and their unique demands. Recent additions, such as Incremental compaction in RocksDB, Delta compaction in Delta Lake, and TimeWindow compaction in Cassandra, have introduced novel approaches to data compaction.

However, the focus of this post remains on the enduring compaction strategies that have stood the test of time — Size Tiered and Leveled compaction. These stalwart methodologies have been at the forefront of data storage optimization for an extended period. In the subsequent sections, we delve into the intricacies of these enduring compaction strategies, shedding light on their principles, benefits, and best practices for implementation and management in distributed database environments.

Let’s explore how these long-established compaction strategies continue to shape the landscape of distributed databases, ensuring data remains organized, accessible, and efficiently stored.

LEVELED COMPACTION:

Leveled compaction divides data into multiple levels or tiers, typically four to ten. When a certain level is filled, compaction is triggered to merge the data in that level into the next level, creating a cascade effect.

  1. How data distribution happens?: Aids in achieving an equitable data distribution across levels, minimizing the occurrence of data imbalance and amplifying read efficiency.
  2. How frequently data compacts?: In order to maintain a more consistent reads, smaller and more frequent compactions are triggered at each level. Typically, leveled compaction results in reduced write amplification when contrasted with size-tiered compaction since it confines compaction operations to a single level at any given time
  3. How much space does it use?: Depending on the use case, it may exhibit lower space efficiency than size-tiered compaction, potentially causing level intersections.

SIZE-TIERED COMPACTION:

Size-tiered compaction groups data files into “tiers” based on their size. When a tier reaches a certain size threshold, compaction is triggered to merge files within that tier.

  1. How data distribution happens?: It might result in few data files being significantly larger than others, potentially leading to read performance related issues on large files.
  2. How frequently data compacts?: Larger, less frequent compactions are triggered when tier size thresholds are met. Size-tiered compaction may have higher write amplification compared to leveled compaction since it compacts multiple files at once.
  3. How much space does it use?: While prioritizing space efficiency, it stands out from leveled compaction by minimizing level overlaps, albeit with potential fragmentation implications. It needs space at least twice the data space, since it needs additional temporary space during compaction.

CONCLUSION:

To recap, leveled compaction is crafted for situations where ensuring consistent and top-notch read performance is imperative. This strategy shines in workloads marked by high write throughput and a substantial need for frequent reads.

Conversely, size-tiered compaction excels in scenarios where prudent storage space utilization is a key concern, and write throughput is not exceptionally high. However, it is important to be aware that size-tiered compaction may introduce fluctuations in read performance due to its management of larger data files. The decision between these two approaches is contingent upon the specific requirements and attributes of the database workload.

--

--