Recently, few people might have heard about Apache Paimon. Undergoing incubation at the Apache Software Foundation (ASF), Apache Paimon is being sponsored by the Apache Incubator. Apache Paimon, with its key features built around datalake storage with ACID characteristics and support for DML operations has joined the space with other competitors that includes DeltaLake, Apache Hudi, Apache Iceberg. Being said that, this post covers about overview, capabilities and distinguishing features of Paimon.
So, What’s Paimon?
Unified storage to build dynamic tables with ACID characteristics over cloud object stores to support high-speed data ingestion, tracking of change-data-capture, efficient real-time analytics and timely data query.
Under the hood, Paimon stores the columnar files on the filesystem/object-store and uses the LSM (Log Structured Merge) tree structure to support a large volume of data updates and high-performance queries.
Though it has all the generic features like their competitors which includes Support of Updates/Deletes, Time Travel, Consistent view guarantees, Schema evolution, Append-only tables, Scalable metadata..etc, this post covers a highlighting feature introduced in Paimon to choose the behavior of merge operation (called as merge engine in Paimon).
Different modes of Merge Engine
It’s the default merge engine in which the latest record will be kept and other records with same primary key will be thrown away. Since this is the usual way of performing merge operation, it is not explained in detail here.
There are cases where producers can’t have details for all columns. This situation occurs mostly when source is a columnar database and due to performance related or other issues, data for all columns are not sent to downstream systems. Leveraging the feature of “partial-update” merge engine helps to resolve the issue in case of situations like mentioned above.
By setting up merge-engine=partial-update, columns of a record can be set through multiple updates in order to form a complete record. For same primary key combination, 1) value fields with non-null values are updated to latest data, 2) value fields with null values are not updated/overwritten.
Below snapshot explains the difference between default merge engine and partial update merge engine while processing the bunch of records.
Sometimes, only aggregated results are required for consumers. Original raw data (before aggregation) would not be required for any further processing. Keeping raw data in a separate table in this scenario is superfluous, and it merely consumes additional space without any practical use.
By setting up merge-engine=aggregation and choosing required aggregate operations on columns, each value field with latest value will get aggregated (using chosen aggregate operation) one by one for the same primary key combination. Supported aggergate functions includes sum, min, max, last_value, last_non_null_value, listagg, bool_and, bool_or.
For example, merge-engine for a table with required aggregation function needs to be set up as follows:
CREATE TABLE AGG_TABLE (
PRIMARY KEY (date) NOT ENFORCED
) WITH (
'merge-engine' = 'aggregation',
'fields.price.aggregate-function' = 'max',
'fields.sales.aggregate-function' = 'sum'
The following snapshot illustrates the contrast in behavior between the aggregation merge engine with and without its implementation.
The diagram speaks for itself and implies that space and time taken to ingest data to additional table can be completely ignored by utilizing the features provided by aggregation merge-engine.
Apache Paimon is presently in its incubation phase, and it is likely that there will be many alterations and upgrades in the forthcoming months. Introduction of new features like merge engine will foster healthy competition among other competitors.
If you’re intrigued, please browse through my other posts on Data Engineering. Feel free to leave a comment and share your thoughts on Paimon. Happy Learning!!!