Simplify ETL Pipelines using Delta Live Tables

Characteristics of Live Tables:

  1. Live tables introduced by Databricks facilitated to create and manage data pipelines that delivers curated high-quality data on Delta Lake.
  2. Simplifies the data pipeline creation by declarative pipeline model, automatic data testing, less verbose, more insights on job monitoring and recovery.
  3. Just by specifying the data source, transformation logic and target system, End-to-End data robust data pipelines can be build easily. Thereby, it reduces the manual efforts and time required to brainstorm and assemble the data processing pipelines.
  4. Enhances data pipeline by preventing corrupted files, bad data through DQ validation and integrity checks.
  5. Batch and Streaming can be enabled using same version of code.

Demo:

  1. Raw Live Table: For most of our Data Engineering use cases, we always have raw data available in a database or filesystem. Though most of the production cases depends on curated, cleansed, aggregated form of data, it always good to keep the raw data available in some form. Live table can be created from data available in S3 as below:
CREATE LIVE TABLE live.songs_list_raw_tbl
COMMENT "Raw Table created to experiment Delta Live Tables. Holds Details on Songs and it's corresponding artist"
AS SELECT * FROM parquet.`s3://delta-live-tables-testing/songs_list`
CREATE LIVE TABLE live.clnsd_prjcd_songs(
CONSTRAINT not_null_artist EXPECT (artist_name IS NOT NULL),
CONSTRAINT latest_decade EXPECT (year > 1999) ON VIOLATION FAIL UPDATE)
COMMENT "Projected, Cleansed version of Raw Live table"
AS SELECT
artist_id,
artist_name,
duration,
release,
song_id,
title,
year
FROM live.songs_list_raw_tbl
CREATE LIVE TABLE live.aggregated_play_list
COMMENT "Aggregated data from cleansed_songs"
AS SELECT
artist_id,
count(song_id) as num_albums
FROM live.clnsd_prjcd_songs
WHERE year < 2010
GROUP BY artist_id
HAVING count(song_id) >= 25

Conclusion:

  1. Why Lazy Evaluation enhances performance and makes Apache Spark distinct?
  2. Is Hadoop complex than other database?
  3. Move S3 Objects faster without any hurdles

References:

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store