Redshift : Enhance performance by choosing right distribution keys
Redshift, being a Massive Parallel Processing (MPP) system that supports data warehousing /OLAP (OnLine Analytics Processing) possess most of the characteristics that are unique to other warehousing systems like Teradata..etc. This article specifically focusses on Distribution key, which is known as primary index in Teradata. Redshift allows more flexibility by providing options to choose distribution style compared to other warehousing systems.
Why Distribution key is important? All MPP systems execute code in parallel across multiple chunks of data to achieve better performance. Improper ways of choosing distribution key leads to increase in query execution time.
For example, consider this following table that contains two columns name and gender with 3 rows having same value for gender column.
In order to understand the performance impact of distribution keys, let us consider a cluster having one node. In this case, same node needs to act as both leader and compute node in Redshift.
Distributing the records based on gender column might results in all 3 records to fall within same node slice (partition of a node and depends on chosen Redshift instance type). This degrades performance, since parallel operation couldn’t be performed.
To improve the performance in this case, records should get distributed on more meaningful column that has unique value distribution i.e name column for this table. Distributing the records based on this columns leads to have same number of records across all node slices. It has been explained in the below diagram where each record goes into separate node slice.
Distribution Styles Supported in Redshift:
- ALL : Selecting ALL distribution keeps the entire copy of tables in all node slices. This option can be chosen only if the table size is small, because it keeps redundant copies of data. If ALL distribution is chosen for above table, then it distributes rows like FIG 1.3. Note: For Apache Spark users, consider this as Broadcast operation that keeps entire dataframe in memory.
2. KEY : Key distribution works the same way as explained in FIG 1.1 and FIG 1.2. This distribution is beneficial for columns used mostly in join operations.
3. EVEN : This distribution would be helpful for tables in denormalized form (couldn’t form primary key) or the table that doesn’t involve joins. It can be preferred for table that doesn’t fall under ALL or KEY distribution.
4. AUTO : If AUTO is selected as distribution style, Redshift automatically choose and change the distribution style based on table size, values of column..etc.