This post covers deploying of kubeflow pipelines on your Mac and play with it(if you’re new to it) before creating/deploying any pipelines to your production system.

Creating Kubernetes cluster : Before creating kubeflow pipeline, kubernetes cluster should get created. If Kubernetes cluster is already available in your environment, please ignore the below step. Though kind, K3s, K3ai supports creating kubernetes cluster, this guide shows to create cluster using kind tool.

kind is a tool for creating kubernetes cluster using Docker container nodes. There are numerous places where cluster created using kind is used for local development.


AWS RDS

Gone are the days when software upgrade of a database alone was executed as three month project. This is just an example comparison between how databases are managed during last decade and current situation. Still I can recollect whatever we did and time we spent together twelve years back to resolve issues related to storage, backups, replication, monitoring, security..etc for hosting and dealing with relational databases without any issues.

Like this, we might get into lot of issues when we are owning everything like replication, backup, software upgrade, failure/disaster recovery, horizontal/vertical scaling, security, monitoring for a relational database. Nowadays, most…


Why R-Studio is required?

R plays a very important role in Data Science by allowing a wide variety of statistical and graphical techniques like linear and nonlinear modeling, time-series analysis, classification, clustering…etc. Latest releases of EMR cluster has its own version of R programming.

However, R programming is supported only through SparkR shell and couldn’t get executed through Zeppelin notebook (to do development in AWS). This has been mentioned in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/zeppelin-considerations.html as well.

This results in having RStudio as an IDE for developing R programming along with Spark in EMR cluster.

Steps to install R-Studio

Installing R Studio only in Master node is required. This can be done…


Redshift, being a Massive Parallel Processing (MPP) system that supports data warehousing /OLAP (OnLine Analytics Processing) possess most of the characteristics that are unique to other warehousing systems like Teradata..etc. This article specifically focusses on Distribution key, which is known as primary index in Teradata. Redshift allows more flexibility by providing options to choose distribution style compared to other warehousing systems.

Why Distribution key is important? All MPP systems execute code in parallel across multiple chunks of data to achieve better performance. Improper ways of choosing distribution key leads to increase in query execution time.


Before we jump into the topic, let us understand the current limitations in Redshift and how AQUA addresses it.

Issues in current Redshift Architecture : With the introduction of RA3 nodes, Redshift provides options to scale and pay for compute and storage independently(same approach like Snowflake). However, the current data architectures with centralized storage (S3) requires data movement to compute clusters for its processing. So, any complex data operation needs a lot of resources to transfer data between nodes.

AQUA -Introduction : AQUA brings the compute closer to storage by processing data in-place on the cache memory. By doing this…


In order to efficiently store data, columnar databases apply a lot of compression techniques. Also, compression techniques helps to 1) reduce storage requirements, 2) reduce disk I/O, 3) improves query performance. Let us take a look at the details of encoding schemes along with examples.

  1. Dictionary Encoding : This is applicable for most of the cases where a column is having same values repeated in most occasions. For example, a gender column can have only two values. Instead of having values as “male” (or) “female” for all the records in the table, dictionary encoding replaces male and female with 0…


If you have ever tried to copy/move millions of objects from one bucket to another bucket, i bet ya that won’t be as simple as is. Below are some of the options everyone could think of.

  1. Initiating copy/move transformations through AWS console: This operation should’t get disturbed at any cost. Especially, if your organization uses something like automatic sign-out based on idle time, then it adds more time to complete the required operation by triggering it manually every time during timeout/failure.
  2. AWS CLI CP Command: This won’t break too frequently like AWS console, however in case if it breaks, it’s…


When it comes to Spark, most of the code were executed through spark-shell or pyspark after building the application. All the execution part of the code gets terminated once spark-shell/pyspark is exited.

Eventually, at some point, notebooks were introduced with rich GUI capabilities to develop Spark application before deploying to production. However, one of the popular notebook, Zeppelin doesn’t kill it’s Spark applications even after hours of idle time. This in turn consumes the EMR resources that’s directly mapped to Spark configs of active Zeppelin interpreters.

These are inactive interpreters and should get killed to avoid paying huge amount for…


NiFi, being a framework designed to automate the flow of data between systems provides rich set of processors to interact with various systems like AWS, Azure, Hadoop, Kafka, HBASE, MongoDB, Couchbase..etc

As headline suggests, this topic covers about ListS3 and PutS3Object to read and write data to S3 respectively.

ListS3:

ListS3 extracts all objects from given S3 bucket. It doesn’t require an incoming relationship. For every object extracted from S3, it creates a FlowFile. Like GetFile processor, it maintains a state to identify the objects created after last iteration. …


Ever thought of how a programming language has been invented? Any thoughts of creating an own programming language? Do you have any idea like how it should be started?

Either compiler or interpreter should be created to introduce a new programming language. Compiler/Interpreter executes the program statements written in your programming language. Compiler transforms the programs to machine code from which your machine processor (like Intel) can execute it to get output (C Programming).Interpreter comes with it’s own virtual machine to interpret the code and for execution (Python, Java).

Creating an interpreter or compiler depends on your need for developing a programming language. Each has it’s own advantages. For example, dynamic typing is supported in Python interpreter, while compiler used in C executes code comparatively faster than Interpreter.

Balachandar Paulraj

Big Data Habitue

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store