Delta Lake Clones: Systematic Approach for Testing, Sharing data

Balachandar Paulraj
4 min readApr 16, 2022

--

Let’s begin with some issues faced in data engineering projects, followed by usage of Delta Lake clones and let’s take a final step by resolving issues.

What’s clone in Delta Lake? It’s just a replica of a source table at a given point in time. In other database terminology, we call it as Snapshot. In that case, what would be the characteristics and benefits of cloned table compared to source table? Do we really need cloned table and where can it be useful? We are right in asking the questions. Let’s understand its characteristics followed by benefits section.

Characteristics of Cloned Table compared to Source Table:

Benefits of using Delta Lake clone:

Though clones can be used on numerous occasions for data engineering projects, below are some places where Delta lake clones simplifies the overall process in minimal amount of time.

  1. Creation of Testing Environment: Creating a Test Database in Staging Environment for development and testing is always a hectic process. It requires huge amount of time and if the process dies in between, it should be started from the scratch again. Also, there is a possibility that views, triggers, constraints and table schema might not get copied in few cases. Creating a clone will be very beneficial here by reducing the overall time and efforts required for completing this process.
  2. Backup Database for Disaster Recovery: For DR, data governance and auditing, instead of creating a separate instance of database, production system can be cloned easily in few lines of code.
  3. Sharing/Archiving Data: Production data can be shared easily to to other teams without worrying about any changes in it.

Demo:

Delta Lake clone can be either Shallow or Deep. Deep Clone copies the complete and independent copies of source table along with data. However, Shallow clone only makes copy of source table’s definition. It refers or points to source table data.

For this demo, let us consider a delta table available in S3 path dbx_clone_medium/delta_testing which consists of recipe dataset. Current data has been formed as part of initial create table (CTAS) followed by 3 append operations. In this case, let us assume that data as of version 2 is required for clone operations. For the given use case, below example code covers more about Shallow and Deep clone.

Shallow clone:

As mentioned earlier, creating a shallow clone only makes a copy of source table definition without copying actual data. The same has been shown here after creating the shallow clone. Initial SQL is to create shallow clone from base delta table of version 2. Output of the SQL itself shows that no files are copied (num_copied_files). The same can be verified by checking the list of contents in S3 folder. S3 path of shallow clone table dbx_clone_medim/delta_testing has nothing other than _delta_log folder to keep track of metadata and base table.

Shallow Clone creation
S3 contents of Shallow Cloned table

Deep Clone:

Deep clone copies the contents of base table along with source table definition. SQL Syntax to create Deep clone table is same as Shallow Clone. After creation of Deep clone table, original data from base table has been copied and it has been reflected in both num_copied_files and S3 path.

Conclusion:

Clones are not Views in database. It has it’s own pros and cons depending on your use case. Understand the differences between Shallow clone, Deep Clone, Views and use it wisely for your requirement.

References:

  1. https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-clone.html
  2. https://databricks.com/blog/2020/09/15/easily-clone-your-delta-lake-for-testing-sharing-and-ml-reproducibility.html#:~:text=Shallow%20clones%20are%20great%20for,of%20the%20table%20being%20cloned

--

--