Why Apache Arrow is faster with PySpark?
Apache Arrow defines a language-independent columnar memory format for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.
Before going through the performance improvements provided by Apache Arrow, let’s quickly go through the usual process happening in PySpark and Pandas without Apache Arrow format. Also, in order to evaluate this, let’s consider a scenario where we need to convert Spark dataframe(with 4 rows) to Pandas.
SPARK DATAFRAME TO PANDAS WITHOUT ARROW:
I hope the above diagram explained the process required to convert Spark DataFrame to Pandas Dataframe. Anyways, Let’s go through the process required to convert Spark DataFrame to Pandas Dataframe in detail.
- Before conversion, Spark pulls the records to the driver node.
- Each row would get serialized into Python’s pickle format
- Send the pickle format back to worker process.
- Worker process deserializes/unpickles each record into a list of tuples.
- Pandas DataFrame is created from the list using from_records() method.
- As all the records are moved to driver, most of the cases the driver memory couldn’t fit a huge dataframe and might fail. Though this can be fixed by increasing driver memory, this needs to be done as an additional step.
- Python serialization/deserialization process executes very slowly.
- Deriving a pandas DataFrame using from_records slowly iterates over each record to convert to Pandas format.
SPARK DATAFRAME TO PANDAS THROUGH ARROW:
Let’s see how Apache Arrow format helps to expedite the processing for the aforementioned scenario.
- As Arrow leverages columnar memory format memory format, there is no need to serialize anymore as Arrow data can be sent directly to the Python process.
- Pyarrow can utilize zero-copy methods to create a pandas.DataFrame from entire chunks of data at once instead of processing individual record like from_records method.
This post only covers benefits of Arrow with respect to one use case : conversion from Spark DataFrame to Pandas and vice-versa. Similarly, there are other benefits and I’m planning to cover it in my upcoming blogs. Thank you for time and reading this post.