Does this simple compression techniques plays a major role in columnar databases?

In order to efficiently store data, columnar databases apply a lot of compression techniques. Also, compression techniques helps to 1) reduce storage requirements, 2) reduce disk I/O, 3) improves query performance. Let us take a look at the details of encoding schemes along with examples.

  1. Run-length Encoding : This encoding is applicable for a column that is sorted based on its values. For example, a daily ETL job inserts thousands of records with same date value (like 2021-Jan-01). Run-length encoding stores the values as “2021-Jan-01”; “200 to 760” (starting to ending row number) a mapping metadata instead of storing the value for every record.
  2. LZO (Lempel–Ziv–Oberhumer) : Lossless data compression algorithm that works especially well for CHAR and VARCHAR columns that store very long character strings.
  3. Delta Encoding : Delta encoding generates the 1)offset from the average values or 2) difference from the previous value and store it as a column’s value. This reduces huge space if the column chosen for compression is a bigint, decimal or double. For example, imagine a table has column value ranges from 1,000,000 (1 million) to (10 million). Instead of storing the value as is, the difference between the average values will be stored as part of delta encoding.

Let us assume we have numbers like 98, 100, 102..etc. The average for the numbers is 100. So, above numbers are stored as -2, 0, +2 instead of original values.

Though there are multiple other encoding schemes like Zstandard, MostlyN and AWS specific AZ64, aforementioned encoding schemes are the ones used in most of the cases.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store