Open in app
Home
Notifications
Lists
Stories

Write
Balachandar Paulraj
Balachandar Paulraj

Home

About

Pinned

2022 : Modern Data Stack

You might have seen multiple posts around this subject as time keeps evolving and bringing changes into tech stack, however this includes recent discovery in data processing frameworks, visualization tools, ETL tools, Development notebooks, Data catalog..etc Over the time, we might have come across different terms like ETL, ELT, Reverse…

Datastack

5 min read

2022 : Modern Data Stack
2022 : Modern Data Stack

Pinned

Why Apache Arrow is faster with PySpark?

Apache Arrow defines a language-independent columnar memory format for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. Before going through the performance improvements provided by Apache Arrow, let’s quickly go through the usual process happening in PySpark and Pandas…

Arrow

2 min read

Why Apache Arrow is faster with PySpark?
Why Apache Arrow is faster with PySpark?

Jun 28

Can Snowpark supersede Databricks and AWS EMR?

Introduction Ever since the introduction of Hadoop, computing power and storage are seen as two different entities (unlike OLTP databases where computing and storage are tightly coupled). For example, Map Reduce, Spark, Flink, Storm are frameworks to process big data in distributed environment, whereas HDFS, Cloud storage (S3, Azure Blob, Google…

Snowflake

3 min read

Can Snowpark supersede Databricks and AWS EMR?
Can Snowpark supersede Databricks and AWS EMR?

Jun 13

Deep Dive into Windowing concepts in Apache Flink

Windows plays a major role and also defined as a core part in processing infinite streams. Windows split the incoming stream into buckets of finite size, over which required transformations can be applied. Though there are many components in windows that includes trigger, evictor, allowedLateness, sideOutputLateData…etc, this post focusses on…

Windows

4 min read

Deep Dive into Windowing concepts in Apache Flink
Deep Dive into Windowing concepts in Apache Flink

Apr 16

Delta Lake Clones: Systematic Approach for Testing, Sharing data

Let’s begin with some issues faced in data engineering projects, followed by usage of Delta Lake clones and let’s take a final step by resolving issues. What’s clone in Delta Lake? It’s just a replica of a source table at a given point in time. In other database terminology, we…

Databricks

4 min read

Delta Lake Clones: Systematic Approach for Testing, Sharing data
Delta Lake Clones: Systematic Approach for Testing, Sharing data

Mar 16

Simplify ETL Pipelines using Delta Live Tables

Consider a common scenario of data engineering pipeline where raw data needs to be cleansed, transformed or aggregated before writing to a target system. For this case, usually we create 3-4 tables to store raw data, cleansed data, transformed data and aggregated data respectively. In order to implement, this needs…

Databricks

6 min read

Simplify ETL Pipelines using Delta Live Tables
Simplify ETL Pipelines using Delta Live Tables

Feb 14

Expedite Spark Processing using Parquet Bloom Filter

So, what’s Bloom filter? Bloom filter index is a space-efficient probabilistic data structure that is used to test whether an element is a member of set. It skips the values of chosen columns, particularly for fields containing arbitrary text. A Bloom filter can tell you if a key is in a set and with…

Databricks

3 min read

Expedite Spark Processing using Parquet Bloom Filter
Expedite Spark Processing using Parquet Bloom Filter

Dec 20, 2021

Databricks AutoLoader : Enhance ETL by simplifying Data Ingestion Process

Introduction: Before we start deep diving on AutoLoader, let us focus on the existing data engineering issues in ingestion process that fits into one of the below categories: High Latency due to batch processing: Though data is landing at regular intervals for every few minutes, most of the cases a batch…

Databricks

4 min read

Databricks AutoLoader : Enhance ETL by simplifying Data Ingestion Process
Databricks AutoLoader : Enhance ETL by simplifying Data Ingestion Process

Oct 12, 2021

Hadoop Admin Interview Questions and Answers : Part 1

Though big data platform has introduced lot of new frameworks like Spark, Druid, Delta Lake, Hudi, Governed tables, Snowflake..etc and covered a huge distance, explored various issues after introduction of Hadoop, but still there are lot of data stores of project set up using Hadoop. So, even nowadays there are…

Hadoop

6 min read

Hadoop Admin Interview Questions and Answers : Part 1
Hadoop Admin Interview Questions and Answers : Part 1

Oct 6, 2021

DeltaLake on EMR

Apologies, if the heading is misleading, because we all know that Databricks and AWS are competing on Data Engineering and Machine Learning space. So, you might be wondering how a Databricks product can get installed on AWS EMR. …

Emr

2 min read

DeltaLake on EMR
DeltaLake on EMR
Balachandar Paulraj

Balachandar Paulraj

Big Data Habitue. Current stint at PlayStation. https://www.linkedin.com/in/balachandar-paulraj-b8a26727

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Knowable