Expedite Spark Processing using Parquet Bloom Filter

Hashing values stored in Bloom filter datatype

So, what’s Bloom filter?

How Bloom filter enhance Apache Spark?

The Bloom filter helps Spark to process only selective input files. It operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP). Through Bloom filter, Spark understands either the records are “possibly in files” or “definitely not in files”.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store