Can Snowpark supersede Databricks and AWS EMR?

Photo by Aaron Burden on Unsplash

Introduction

Though Snowflake keeps everything as a single system and doesn’t make developers to worry too much about their storage (Snowflake uses Cloud storage under the hood), their Query editor environment lacks building ML jobs or ETL pipelines. Now, Snowflake fixes the above issues through the introduction of Snowpark.

Snowpark

  1. Objective of Snowpark is to have the developers work right to where the data lives in, instead of separate environment (like Databricks Notebook or EMR’s Zeppelin Notebook in case of development).
  2. Provides an intuitive API for querying data and executes complex ETL/ELT pipelines directly on Snowflake.
  3. Provides scalable, secured, compute engine to process huge loads of data. Elastic scaling support also has been given in Snowpark.
  4. Enables compute directly in Snowflake along with the functionalities of DataFrame API, supported in Java, Scala and Python.
  5. Possess features like Intelligent code completion and Type checking.

Snowpark compared with Databricks and AWS EMR

Though Snowpark has peculiar features, below table summarizes the similar and different characteristics compared to Spark running in Databricks environment or AWS EMR.

Snowpark compared to Spark

To summarize the above diagram, Snowpark has the advantages of Spark (listed under same characteristics) along with additional benefits related to infrastructure, performance and data transfer (different characteristics).

To clarify further, the compute and storage layers of Spark running on its own environment vs Snowpark running on Snowflake has been presented here in the following diagrams. Data movement happening between Storage and Compute layer in Spark creates an additional impact to performance and latency.

Compute and Storage for Processing through Spark

However, Snowpark API possess the capabilities of sending and executing code directly in Snowflake. It helps the ETL Jobs, Machine Learning pipelines to drastically reduce the overall time required for data transformations and its processing. Snowpark API keeps the data in Snowflake to avoid needless data movement, executes the required transformations using Snowflake, and possesses auto-scaling capability. It does all the mentioned functionalities without the additional need of creating separate infrastructure.

Compute and Storage for Processing through Snowpark

Conclusion

If you’re interested, please check my other posts on Data Engineering as well.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store