Can Snowpark supersede Databricks and AWS EMR?
Ever since the introduction of Hadoop, computing power and storage are seen as two different entities (unlike OLTP databases where computing and storage are tightly coupled). For example, Map Reduce, Spark, Flink, Storm are frameworks to process big data in distributed environment, whereas HDFS, Cloud storage (S3, Azure Blob, Google cloud storage) are utilized to store data. Though isolation of computation and storage offers multiple benefits, one concern might be the data movement between storage and computing systems.
Though Snowflake keeps everything as a single system and doesn’t make developers to worry too much about their storage (Snowflake uses Cloud storage under the hood), their Query editor environment lacks building ML jobs or ETL pipelines. Now, Snowflake fixes the above issues through the introduction of Snowpark.
Before going through the distinguishing features of Snowpark and Spark, let us understand the characteristics of Snowpark below:
- Objective of Snowpark is to have the developers work right to where the data lives in, instead of separate environment (like Databricks Notebook or EMR’s Zeppelin Notebook in case of development).
- Provides an intuitive API for querying data and executes complex ETL/ELT pipelines directly on Snowflake.
- Provides scalable, secured, compute engine to process huge loads of data. Elastic scaling support also has been given in Snowpark.
- Enables compute directly in Snowflake along with the functionalities of DataFrame API, supported in Java, Scala and Python.
- Possess features like Intelligent code completion and Type checking.
Snowpark compared with Databricks and AWS EMR
By enabling compute directly in Snowflake with own DataFrame API that has support in multiple programming languages, Snowpark is a paradigm shift in current data stack for data stored in cloud.
Though Snowpark has peculiar features, below table summarizes the similar and different characteristics compared to Spark running in Databricks environment or AWS EMR.
To summarize the above diagram, Snowpark has the advantages of Spark (listed under same characteristics) along with additional benefits related to infrastructure, performance and data transfer (different characteristics).
To clarify further, the compute and storage layers of Spark running on its own environment vs Snowpark running on Snowflake has been presented here in the following diagrams. Data movement happening between Storage and Compute layer in Spark creates an additional impact to performance and latency.
However, Snowpark API possess the capabilities of sending and executing code directly in Snowflake. It helps the ETL Jobs, Machine Learning pipelines to drastically reduce the overall time required for data transformations and its processing. Snowpark API keeps the data in Snowflake to avoid needless data movement, executes the required transformations using Snowflake, and possesses auto-scaling capability. It does all the mentioned functionalities without the additional need of creating separate infrastructure.
Though Snowpark is in its evolving stage, it provides a glimmer of hope that it brings out the new dimension of compute mechanism in latest data stack. Also, glad to see the enhancements happening in data space that brings out healthy competition with other contestants.
If you’re interested, please check my other posts on Data Engineering as well.
Simplify ETL Pipelines using Delta Live Tables
Consider a common scenario of data engineering pipeline where raw data needs to be cleansed, transformed or aggregated…
Why Apache Arrow is faster with PySpark?
Apache Arrow defines a language-independent columnar memory format for in-memory analytics. It contains a set of…