Open in app

Sign In

Write

Sign In

Balachandar Paulraj
Balachandar Paulraj

246 Followers

Home

About

Pinned

2022 : Modern Data Stack

You might have seen multiple posts around this subject as time keeps evolving and bringing changes into tech stack, however this includes recent discovery in data processing frameworks, visualization tools, ETL tools, Development notebooks, Data catalog..etc Over the time, we might have come across different terms like ETL, ELT, Reverse…

Datastack

5 min read

2022 : Modern Data Stack
2022 : Modern Data Stack
Datastack

5 min read


Pinned

Why Apache Arrow is faster with PySpark?

Apache Arrow defines a language-independent columnar memory format for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. Before going through the performance improvements provided by Apache Arrow, let’s quickly go through the usual process happening in PySpark and Pandas…

Arrow

2 min read

Why Apache Arrow is faster with PySpark?
Why Apache Arrow is faster with PySpark?
Arrow

2 min read


Jun 28, 2022

Can Snowpark supersede Databricks and AWS EMR?

Introduction Ever since the introduction of Hadoop, computing power and storage are seen as two different entities (unlike OLTP databases where computing and storage are tightly coupled). For example, Map Reduce, Spark, Flink, Storm are frameworks to process big data in distributed environment, whereas HDFS, Cloud storage (S3, Azure Blob, Google…

Snowflake

3 min read

Can Snowpark supersede Databricks and AWS EMR?
Can Snowpark supersede Databricks and AWS EMR?
Snowflake

3 min read


Jun 13, 2022

Deep Dive into Windowing concepts in Apache Flink

Windows plays a major role and also defined as a core part in processing infinite streams. Windows split the incoming stream into buckets of finite size, over which required transformations can be applied. Though there are many components in windows that includes trigger, evictor, allowedLateness, sideOutputLateData…etc, this post focusses on…

Windows

4 min read

Deep Dive into Windowing concepts in Apache Flink
Deep Dive into Windowing concepts in Apache Flink
Windows

4 min read


Apr 16, 2022

Delta Lake Clones: Systematic Approach for Testing, Sharing data

Let’s begin with some issues faced in data engineering projects, followed by usage of Delta Lake clones and let’s take a final step by resolving issues. What’s clone in Delta Lake? It’s just a replica of a source table at a given point in time. In other database terminology, we…

Databricks

4 min read

Delta Lake Clones: Systematic Approach for Testing, Sharing data
Delta Lake Clones: Systematic Approach for Testing, Sharing data
Databricks

4 min read


Mar 16, 2022

Simplify ETL Pipelines using Delta Live Tables

Consider a common scenario of data engineering pipeline where raw data needs to be cleansed, transformed or aggregated before writing to a target system. For this case, usually we create 3-4 tables to store raw data, cleansed data, transformed data and aggregated data respectively. In order to implement, this needs…

Databricks

6 min read

Simplify ETL Pipelines using Delta Live Tables
Simplify ETL Pipelines using Delta Live Tables
Databricks

6 min read


Feb 14, 2022

Expedite Spark Processing using Parquet Bloom Filter

So, what’s Bloom filter? Bloom filter index is a space-efficient probabilistic data structure that is used to test whether an element is a member of set. It skips the values of chosen columns, particularly for fields containing arbitrary text. A Bloom filter can tell you if a key is in a set and with…

Databricks

3 min read

Expedite Spark Processing using Parquet Bloom Filter
Expedite Spark Processing using Parquet Bloom Filter
Databricks

3 min read


Dec 20, 2021

Databricks AutoLoader : Enhance ETL by simplifying Data Ingestion Process

Introduction: Before we start deep diving on AutoLoader, let us focus on the existing data engineering issues in ingestion process that fits into one of the below categories: High Latency due to batch processing: Though data is landing at regular intervals for every few minutes, most of the cases a batch…

Databricks

4 min read

Databricks AutoLoader : Enhance ETL by simplifying Data Ingestion Process
Databricks AutoLoader : Enhance ETL by simplifying Data Ingestion Process
Databricks

4 min read


Oct 12, 2021

Hadoop Admin Interview Questions and Answers : Part 1

Though big data platform has introduced lot of new frameworks like Spark, Druid, Delta Lake, Hudi, Governed tables, Snowflake..etc and covered a huge distance, explored various issues after introduction of Hadoop, but still there are lot of data stores of project set up using Hadoop. So, even nowadays there are…

Hadoop

6 min read

Hadoop Admin Interview Questions and Answers : Part 1
Hadoop Admin Interview Questions and Answers : Part 1
Hadoop

6 min read


Oct 6, 2021

DeltaLake on EMR

Apologies, if the heading is misleading, because we all know that Databricks and AWS are competing on Data Engineering and Machine Learning space. So, you might be wondering how a Databricks product can get installed on AWS EMR. …

Emr

2 min read

DeltaLake on EMR
DeltaLake on EMR
Emr

2 min read

Balachandar Paulraj

Balachandar Paulraj

246 Followers

Big Data Habitue. Current stint at PlayStation. https://www.linkedin.com/in/balachandar-paulraj-b8a26727

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech