Does this simple compression techniques plays a major role in columnar databases?
In order to efficiently store data, columnar databases apply a lot of compression techniques. Also, compression techniques helps to 1) reduce storage requirements, 2) reduce disk I/O, 3) improves query performance. Let us take a look at the details of encoding schemes along with examples.
- Dictionary Encoding : This is applicable for most of the cases where a column is having same values repeated in most occasions. For example, a gender column can have only two values. Instead of having values as “male” (or) “female” for all the records in the table, dictionary encoding replaces male and female with 0 and 1 respectively. This in turn reduces the overall space required by the column by mapping larger column values to smaller ones.
- Run-length Encoding : This encoding is applicable for a column that is sorted based on its values. For example, a daily ETL job inserts thousands of records with same date value (like 2021-Jan-01). Run-length encoding stores the values as “2021-Jan-01”; “200 to 760” (starting to ending row number) a mapping metadata instead of storing the value for every record.
- LZO (Lempel–Ziv–Oberhumer) : Lossless data compression algorithm that works especially well for CHAR and VARCHAR columns that store very long character strings.
- Delta Encoding : Delta encoding generates the 1)offset from the average values or 2) difference from the previous value and store it as a column’s value. This reduces huge space if the column chosen for compression is a bigint, decimal or double. For example, imagine a table has column value ranges from 1,000,000 (1 million) to (10 million). Instead of storing the value as is, the difference between the average values will be stored as part of delta encoding.
Let us assume we have numbers like 98, 100, 102..etc. The average for the numbers is 100. So, above numbers are stored as -2, 0, +2 instead of original values.
Though there are multiple other encoding schemes like Zstandard, MostlyN and AWS specific AZ64, aforementioned encoding schemes are the ones used in most of the cases.