Apache Arrow defines a language-independent columnar memory format for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Before going through the performance improvements provided by Apache Arrow, let’s quickly go through the usual process happening in PySpark and Pandas without Apache Arrow format. Also, in order to evaluate this, let’s consider a scenario where we need to convert Spark dataframe(with 4 rows) to Pandas.

SPARK DATAFRAME TO PANDAS WITHOUT ARROW:


I guess the introduction of Apache Superset makes everyone little confused on. Hopefully, I have chosen a right topic for my blog during the peak of summer season. If you are also one among the person evaluating different visualization tools for your project, this should help you.

Before diving deep, let us understand a bit about the latest player entering into podium i.e Superset. Apache Superset is an open source data visualization tool built out of AirBnB’s hackathon project in 2015.

In this post, Superset will be evaluated with Tableau first and then with another open source, Redash.

SUPERSET vs TABLEAU

  1. Data Sources

This post covers deploying of kubeflow pipelines on your Mac and play with it(if you’re new to it) before creating/deploying any pipelines to your production system.

Creating Kubernetes cluster : Before creating kubeflow pipeline, kubernetes cluster should get created. If Kubernetes cluster is already available in your environment, please ignore the below step. Though kind, K3s, K3ai supports creating kubernetes cluster, this guide shows to create cluster using kind tool.

kind is a tool for creating kubernetes cluster using Docker container nodes. There are numerous places where cluster created using kind is used for local development.


AWS RDS

Gone are the days when software upgrade of a database alone was executed as three month project. This is just an example comparison between how databases are managed during last decade and current situation. Still I can recollect whatever we did and time we spent together twelve years back to resolve issues related to storage, backups, replication, monitoring, security..etc for hosting and dealing with relational databases without any issues.

Like this, we might get into lot of issues when we are owning everything like replication, backup, software upgrade, failure/disaster recovery, horizontal/vertical scaling, security, monitoring for a relational database. Nowadays, most…


Why R-Studio is required?

R plays a very important role in Data Science by allowing a wide variety of statistical and graphical techniques like linear and nonlinear modeling, time-series analysis, classification, clustering…etc. Latest releases of EMR cluster has its own version of R programming.

However, R programming is supported only through SparkR shell and couldn’t get executed through Zeppelin notebook (to do development in AWS). This has been mentioned in https://docs.aws.amazon.com/emr/latest/ReleaseGuide/zeppelin-considerations.html as well.

This results in having RStudio as an IDE for developing R programming along with Spark in EMR cluster.

Steps to install R-Studio

Installing R Studio only in Master node is required. This can be done…


Redshift, being a Massive Parallel Processing (MPP) system that supports data warehousing /OLAP (OnLine Analytics Processing) possess most of the characteristics that are unique to other warehousing systems like Teradata..etc. This article specifically focusses on Distribution key, which is known as primary index in Teradata. Redshift allows more flexibility by providing options to choose distribution style compared to other warehousing systems.

Why Distribution key is important? All MPP systems execute code in parallel across multiple chunks of data to achieve better performance. Improper ways of choosing distribution key leads to increase in query execution time.


Before we jump into the topic, let us understand the current limitations in Redshift and how AQUA addresses it.

Issues in current Redshift Architecture : With the introduction of RA3 nodes, Redshift provides options to scale and pay for compute and storage independently(same approach like Snowflake). However, the current data architectures with centralized storage (S3) requires data movement to compute clusters for its processing. So, any complex data operation needs a lot of resources to transfer data between nodes.

AQUA -Introduction : AQUA brings the compute closer to storage by processing data in-place on the cache memory. By doing this…


In order to efficiently store data, columnar databases apply a lot of compression techniques. Also, compression techniques helps to 1) reduce storage requirements, 2) reduce disk I/O, 3) improves query performance. Let us take a look at the details of encoding schemes along with examples.

  1. Dictionary Encoding : This is applicable for most of the cases where a column is having same values repeated in most occasions. For example, a gender column can have only two values. Instead of having values as “male” (or) “female” for all the records in the table, dictionary encoding replaces male and female with 0…


If you have ever tried to copy/move millions of objects from one bucket to another bucket, i bet ya that won’t be as simple as is. Below are some of the options everyone could think of.

  1. Initiating copy/move transformations through AWS console: This operation should’t get disturbed at any cost. Especially, if your organization uses something like automatic sign-out based on idle time, then it adds more time to complete the required operation by triggering it manually every time during timeout/failure.
  2. AWS CLI CP Command: This won’t break too frequently like AWS console, however in case if it breaks, it’s…


When it comes to Spark, most of the code were executed through spark-shell or pyspark after building the application. All the execution part of the code gets terminated once spark-shell/pyspark is exited.

Eventually, at some point, notebooks were introduced with rich GUI capabilities to develop Spark application before deploying to production. However, one of the popular notebook, Zeppelin doesn’t kill it’s Spark applications even after hours of idle time. This in turn consumes the EMR resources that’s directly mapped to Spark configs of active Zeppelin interpreters.

These are inactive interpreters and should get killed to avoid paying huge amount for…

Balachandar Paulraj

Big Data Habitue. Works at PlayStation. LinkedIn: https://www.linkedin.com/in/balachandar-paulraj-b8a26727

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store