Open in app

Sign in

Write

Sign in

Balachandar Paulraj
Balachandar Paulraj

311 Followers

Home

Lists

About

Pinned

Essential Considerations for Data Engineers When Selecting a NoSQL Database

In the realm of modern data engineering, the choices abound, and the stakes are high. Data engineers are the architects of the digital age, tasked with crafting the data foundations upon which businesses build their futures. …

NoSQL

3 min read

Essential Considerations for Data Engineers When Selecting a NoSQL Database
Essential Considerations for Data Engineers When Selecting a NoSQL Database
NoSQL

3 min read


Pinned

2022 : Modern Data Stack

You might have seen multiple posts around this subject as time keeps evolving and bringing changes into tech stack, however this includes recent discovery in data processing frameworks, visualization tools, ETL tools, Development notebooks, Data catalog..etc Over the time, we might have come across different terms like ETL, ELT, Reverse…

Datastack

5 min read

2022 : Modern Data Stack
2022 : Modern Data Stack
Datastack

5 min read


Pinned

DuckDB: Primer on the subject and fascinating highlights

Throughout our data engineering journey, we’ve come across a myriad of database management systems (DBMS). But what sets DuckDB apart from the rest? And is it worth delving into? Let’s embark on a quest for answers. What’s DuckDB? The original purpose behind DuckDB’s creation was to empower analytical query workloads and facilitate…

Duckdb

4 min read

DuckDB: Primer on the subject and fascinating highlights
DuckDB: Primer on the subject and fascinating highlights
Duckdb

4 min read


Nov 19

Fast-Track PySpark UDF execution with Apache Arrow

Developers often create custom UDFs (user-defined-functions) in their Spark code to handle specific transformations. This allows users to develop personalized code for their unique data processing requirements. PROBLEM STATEMENT Despite the myriad advantages that UDF brings to Spark, the (de)serialization process in Python heavily relies on the pickle format (specifically…

Spark

4 min read

Fast-Track PySpark UDF execution with Apache Arrow
Fast-Track PySpark UDF execution with Apache Arrow
Spark

4 min read


Nov 6

RAY: Distributed computing framework for ML & AI

The evolving domain of artificial intelligence and machine learning is witnessing an unprecedented demand for tools that are efficient, scalable, and user-intuitive. The quest for resilient frameworks capable of handling intricate AI workloads has reached an all-time high. …

Apache Ray

4 min read

RAY: Distributed computing framework for ML & AI
RAY: Distributed computing framework for ML & AI
Apache Ray

4 min read


Sep 4

Key Database Compaction Strategies Used In Distributed System

In the realm of distributed database systems, the adoption of compaction strategies plays a pivotal role in the effective management of data storage. As the data landscape continues to evolve, a multitude of innovative compaction strategies have emerged, each catering to specific database technologies and their unique demands. …

Compaction

3 min read

Key Database Compaction Strategies Used In Distributed System
Key Database Compaction Strategies Used In Distributed System
Compaction

3 min read


Apr 3

Apache Paimon: A fresh face joins the fray

Recently, few people might have heard about Apache Paimon. Undergoing incubation at the Apache Software Foundation (ASF), Apache Paimon is being sponsored by the Apache Incubator. Apache Paimon, with its key features built around datalake storage with ACID characteristics and support for DML operations has joined the space with other…

Paimon

3 min read

Apache Paimon: A fresh face joins the fray
Apache Paimon: A fresh face joins the fray
Paimon

3 min read


Jun 28, 2022

Can Snowpark supersede Databricks and AWS EMR?

Introduction Ever since the introduction of Hadoop, computing power and storage are seen as two different entities (unlike OLTP databases where computing and storage are tightly coupled). For example, Map Reduce, Spark, Flink, Storm are frameworks to process big data in distributed environment, whereas HDFS, Cloud storage (S3, Azure Blob, Google…

Snowflake

3 min read

Can Snowpark supersede Databricks and AWS EMR?
Can Snowpark supersede Databricks and AWS EMR?
Snowflake

3 min read


Jun 13, 2022

Deep Dive into Windowing concepts in Apache Flink

Windows plays a major role and also defined as a core part in processing infinite streams. Windows split the incoming stream into buckets of finite size, over which required transformations can be applied. Though there are many components in windows that includes trigger, evictor, allowedLateness, sideOutputLateData…etc, this post focusses on…

Windows

4 min read

Deep Dive into Windowing concepts in Apache Flink
Deep Dive into Windowing concepts in Apache Flink
Windows

4 min read


Apr 16, 2022

Delta Lake Clones: Systematic Approach for Testing, Sharing data

Let’s begin with some issues faced in data engineering projects, followed by usage of Delta Lake clones and let’s take a final step by resolving issues. What’s clone in Delta Lake? It’s just a replica of a source table at a given point in time. In other database terminology, we…

Databricks

4 min read

Delta Lake Clones: Systematic Approach for Testing, Sharing data
Delta Lake Clones: Systematic Approach for Testing, Sharing data
Databricks

4 min read

Balachandar Paulraj

Balachandar Paulraj

311 Followers

Big Data Habitue. Current stint at PlayStation. https://www.linkedin.com/in/balachandar-paulraj-b8a26727

Help

Status

About

Careers

Blog

Privacy

Terms

Text to speech

Teams