Why R-Studio is required?

R plays a very important role in Data Science by allowing a wide variety of statistical and graphical techniques like linear and nonlinear modeling, time-series analysis, classification, clustering…etc. Latest releases of EMR cluster has its own version of R programming.

However, R programming is supported only through SparkR shell and couldn’t get executed through Zeppelin notebook (to do development in AWS). This has been mentioned in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/zeppelin-considerations.html as well.

This results in having RStudio as an IDE for developing R programming along with Spark in EMR cluster.

Installing R Studio only in Master node is required. This can be done…


Redshift, being a Massive Parallel Processing (MPP) system that supports data warehousing /OLAP (OnLine Analytics Processing) possess most of the characteristics that are unique to other warehousing systems like Teradata..etc. This article specifically focusses on Distribution key, which is known as primary index in Teradata. Redshift allows more flexibility by providing options to choose distribution style compared to other warehousing systems.

Why Distribution key is important? All MPP systems execute code in parallel across multiple chunks of data to achieve better performance. Improper ways of choosing distribution key leads to increase in query execution time.


Before we jump into the topic, let us understand the current limitations in Redshift and how AQUA addresses it.

Issues in current Redshift Architecture : With the introduction of RA3 nodes, Redshift provides options to scale and pay for compute and storage independently(same approach like Snowflake). However, the current data architectures with centralized storage (S3) requires data movement to compute clusters for its processing. So, any complex data operation needs a lot of resources to transfer data between nodes.

AQUA -Introduction : AQUA brings the compute closer to storage by processing data in-place on the cache memory. By doing this…


In order to efficiently store data, columnar databases apply a lot of compression techniques. Also, compression techniques helps to 1) reduce storage requirements, 2) reduce disk I/O, 3) improves query performance. Let us take a look at the details of encoding schemes along with examples.

  1. Dictionary Encoding : This is applicable for most of the cases where a column is having same values repeated in most occasions. For example, a gender column can have only two values. Instead of having values as “male” (or) “female” for all the records in the table, dictionary encoding replaces male and female with 0…

If you have ever tried to copy/move millions of objects from one bucket to another bucket, i bet ya that won’t be as simple as is. Below are some of the options everyone could think of.

  1. Initiating copy/move transformations through AWS console: This operation should’t get disturbed at any cost. Especially, if your organization uses something like automatic sign-out based on idle time, then it adds more time to complete the required operation by triggering it manually every time during timeout/failure.
  2. AWS CLI CP Command: This won’t break too frequently like AWS console, however in case if it breaks, it’s…

When it comes to Spark, most of the code were executed through spark-shell or pyspark after building the application. All the execution part of the code gets terminated once spark-shell/pyspark is exited.

Eventually, at some point, notebooks were introduced with rich GUI capabilities to develop Spark application before deploying to production. However, one of the popular notebook, Zeppelin doesn’t kill it’s Spark applications even after hours of idle time. This in turn consumes the EMR resources that’s directly mapped to Spark configs of active Zeppelin interpreters.

These are inactive interpreters and should get killed to avoid paying huge amount for…


NiFi, being a framework designed to automate the flow of data between systems provides rich set of processors to interact with various systems like AWS, Azure, Hadoop, Kafka, HBASE, MongoDB, Couchbase..etc

As headline suggests, this topic covers about ListS3 and PutS3Object to read and write data to S3 respectively.

ListS3 extracts all objects from given S3 bucket. It doesn’t require an incoming relationship. For every object extracted from S3, it creates a FlowFile. Like GetFile processor, it maintains a state to identify the objects created after last iteration. …


Ever thought of how a programming language has been invented? Any thoughts of creating an own programming language? Do you have any idea like how it should be started?

Either compiler or interpreter should be created to introduce a new programming language. Compiler/Interpreter executes the program statements written in your programming language. Compiler transforms the programs to machine code from which your machine processor (like Intel) can execute it to get output (C Programming).Interpreter comes with it’s own virtual machine to interpret the code and for execution (Python, Java).

Creating an interpreter or compiler depends on your need for developing a programming language. Each has it’s own advantages. For example, dynamic typing is supported in Python interpreter, while compiler used in C executes code comparatively faster than Interpreter.


Just assume that you are running a pawn shop, where you are accepting only gold biscuits in exchange of money. Your shop has been designed full of containers to safekeep gold biscuits. But each container can hold only one gold biscuit. Your Job is to get the gold biscuits from customers, place it in container and to provide equivalent amount to customers.

As the weight of gold biscuit is too heavy, you can carry only one gold biscuit at a time. Time taken for placing a gold biscuit from customer place to container approximately takes 10 seconds. …


Let us learn about the concepts of Linear Regression by relating it with single input of data. Like most of the learning URL’s, let me consider square feet of an area as an input (denoted as x). Based on the input, I’ll be predicting price of a house (denoted as y).

Need For Algorithm:

Balachandar Paulraj

It’s data that rules me

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store