DeltaLake on EMR
Apologies, if the heading is misleading, because we all know that Databricks and AWS are competing on Data Engineering and Machine Learning space. So, you might be wondering how a Databricks product can get installed on AWS EMR. To clear things in air, this article provides details on how to leverage complete benefits of open source version of Delta Lake on EMR.
Delta lake is an open source storage layer that brings reliability to data lakes by offering core functionalities like 1) ACID Transactions, 2) Versioning and Time Travel, 3) Updates and deletes, 4) Schema Validation. Being a robust storage solution, it has been designed specifically to work with Apache Spark.
Below are the activities that needs to be performed to resume working with Delta Lake on EMR.
- Installation of delta core jar: Like installation of other jars, delta-core_<version>.jar needs to be placed inside the jars folder under Spark home path in master node. For EMR cluster, it’s /usr/lib/spark/jars. This process also can be installed by adding the required code in a script file and executed as an EMR step.
- Required Configurations: The configurations can be set either in spark configurations file in master node (or) added as additional configuration in zeppelin interpreter (in case zeppelin notebook is used as development environment). Below are the two mandatory configurations required to access delta lake functionalities.
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
3. Optional Configurations: In addition to the above mandatory configurations, depends on the filesystem, below configurations needs to be enabled.
For S3, spark.delta.logStore.class= org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
For HDFS, spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore
After following aforementioned steps, we should be good to play with the capabilities of Delta Lake.