Why Apache Arrow is faster with PySpark?

Apache Arrow defines a language-independent columnar memory format for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Before going through the performance improvements provided by Apache Arrow, let’s quickly go through the usual process happening in PySpark and Pandas without Apache Arrow format. Also, in order to evaluate this, let’s consider a scenario where we need to convert Spark dataframe(with 4 rows) to Pandas.

SPARK DATAFRAME TO PANDAS WITHOUT ARROW:

I hope the above diagram explained the process required to convert Spark DataFrame to Pandas Dataframe. Anyways, Let’s go through the process required to convert Spark DataFrame to Pandas Dataframe in detail.

Cons:

SPARK DATAFRAME TO PANDAS THROUGH ARROW:

Let’s see how Apache Arrow format helps to expedite the processing for the aforementioned scenario.

This post only covers benefits of Arrow with respect to one use case : conversion from Spark DataFrame to Pandas and vice-versa. Similarly, there are other benefits and I’m planning to cover it in my upcoming blogs. Thank you for time and reading this post.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store