Expedite Spark Processing using Parquet Bloom Filter
So, what’s Bloom filter?
Bloom filter index is a space-efficient probabilistic data structure that is used to test whether an element is a member of set. It skips the values of chosen columns, particularly for fields containing arbitrary text.
A Bloom filter can tell you if a key is in a set and with a false positive percentage (fpp), if an item might be in a set. i.e It can generate false positive result, but it never generate false negative result. Bloom filter is a perfect use case for set membership problem.
For example, checking availability of username is set membership problem, where the set is the list of all registered username. False positive means, it might tell that given username is already taken but actually it’s not. However, bloom filters never generate false negative result, i.e. telling you that a username doesn’t exist when it actually exists.
How Bloom filter enhance Apache Spark?
The Bloom filter helps Spark to process only selective input files. It operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP). Through Bloom filter, Spark understands either the records are “possibly in files” or “definitely not in files”.
Pros: It helps to speed up data look up when you want to filter out data based on column that contains highly unsorted unique value.
Cons: There is a tradeoff between space and speed. Bloom filter occupies additional space in filesystem by creating additional set of metadata files. Bloom filter creates an index file (metadata) for every data file. For example, Bloom filter creates an index named <data_file_name>.index.v1.parquet for every data file.
Demo using Databricks Notebook:
Let’s start with creating a table with sample columns and a bloom filter on top of it. I’m using Spark SQL interpreter within Databricks notebook for this demo. Please feel free to use the below query for your initial validation.
create or replace table bloomFilterTesting using delta location ‘dbfs:/bala/bloom_test’ as select id, monotonically_increasing_id() id2 from range(0, 10000);
create bloomfilter index on table bloomFilterTesting for columns(id2 OPTIONS (fpp=0.1, numItems=10000))
First SQL statement creates table using CTAS command with two columns having values as incremental sequence number and an increasing id respectively. Once the table gets created, a query to create a bloom filter index has been executed.
With the creation of bloom filter, it’s good to run optimize command (below query) to arrange data based on column chosen for bloom filter.
OPTIMIZE bloomFilterTesting ZORDER BY id2
After running optimize command, you can see a folder _delta_index present within location of the table directory that contains metadata. For my use case, it’s present under the table location dbfs:/bala/bloom_test. Within that folder, you can find the presence of parquet files with same name like original file name, but with index.v1.parquet added in suffix. I have attached the snapshot of metadata files available under _delta_index folder. Basically, this file contains metadata of the actual content that is used to filter only selective files for Spark.
Now, with the platform set, we can directly jump to do performance validation for bloom filter column vs other column.
While performing the experiments, running a select query on columns enabled (select from bloomFilterTesting where id2=<id2_value>) with bloom filter performs way better compared to a select query running on non-bloom filter column (select from bloomFilterTesting where id=<id_value>).
Have always achieved 2X better performance for bloom filter column. For interested folks, please try this and share the result as comments.
For more stats on performance comparison between with and without bloom filter, please refer https://docs.databricks.com/delta/optimizations/bloom-filters.html
Thank you, Happy Learning!!!