Why Apache Arrow is faster with PySpark?

  1. Before conversion, Spark pulls the records to the driver node.
  2. Each row would get serialized into Python’s pickle format
  3. Send the pickle format back to worker process.
  4. Worker process deserializes/unpickles each record into a list of tuples.
  5. Pandas DataFrame is created from the list using from_records() method.
  1. As all the records are moved to driver, most of the cases the driver memory couldn’t fit a huge dataframe and might fail. Though this can be fixed by increasing driver memory, this needs to be done as an additional step.
  2. Python serialization/deserialization process executes very slowly.
  3. Deriving a pandas DataFrame using from_records slowly iterates over each record to convert to Pandas format.
  1. As Arrow leverages columnar memory format memory format, there is no need to serialize anymore as Arrow data can be sent directly to the Python process.
  2. Pyarrow can utilize zero-copy methods to create a pandas.DataFrame from entire chunks of data at once instead of processing individual record like from_records method.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store