DeltaLake on EMR

Apologies, if the heading is misleading, because we all know that Databricks and AWS are competing on Data Engineering and Machine Learning space. So, you might be wondering how a Databricks product can get installed on AWS EMR. To clear things in air, this article provides details on how to leverage complete benefits of open source version of Delta Lake on EMR.

Delta lake is an open source storage layer that brings reliability to data lakes by offering core functionalities like 1) ACID Transactions, 2) Versioning and Time Travel, 3) Updates and deletes, 4) Schema Validation. Being a robust storage solution, it has been designed specifically to work with Apache Spark.

Below are the activities that needs to be performed to resume working with Delta Lake on EMR.

  1. Installation of delta core jar: Like installation of other jars, delta-core_<version>.jar needs to be placed inside the jars folder under Spark home path in master node. For EMR cluster, it’s /usr/lib/spark/jars. This process also can be installed by adding the required code in a script file and executed as an EMR step.

spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

3. Optional Configurations: In addition to the above mandatory configurations, depends on the filesystem, below configurations needs to be enabled.

For S3, spark.delta.logStore.class= org.apache.spark.sql.delta.storage.S3SingleDriverLogStore

For HDFS, spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore

After following aforementioned steps, we should be good to play with the capabilities of Delta Lake.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store