Does this simple compression techniques plays a major role in columnar databases?

  1. Dictionary Encoding : This is applicable for most of the cases where a column is having same values repeated in most occasions. For example, a gender column can have only two values. Instead of having values as “male” (or) “female” for all the records in the table, dictionary encoding replaces male and female with 0 and 1 respectively. This in turn reduces the overall space required by the column by mapping larger column values to smaller ones.
  2. Run-length Encoding : This encoding is applicable for a column that is sorted based on its values. For example, a daily ETL job inserts thousands of records with same date value (like 2021-Jan-01). Run-length encoding stores the values as “2021-Jan-01”; “200 to 760” (starting to ending row number) a mapping metadata instead of storing the value for every record.
  3. LZO (Lempel–Ziv–Oberhumer) : Lossless data compression algorithm that works especially well for CHAR and VARCHAR columns that store very long character strings.
  4. Delta Encoding : Delta encoding generates the 1)offset from the average values or 2) difference from the previous value and store it as a column’s value. This reduces huge space if the column chosen for compression is a bigint, decimal or double. For example, imagine a table has column value ranges from 1,000,000 (1 million) to (10 million). Instead of storing the value as is, the difference between the average values will be stored as part of delta encoding.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store