Cost Reduction in EMR cluster hosting Zeppelin notebooks

When it comes to Spark, most of the code were executed through spark-shell or pyspark after building the application. All the execution part of the code gets terminated once spark-shell/pyspark is exited.

Eventually, at some point, notebooks were introduced with rich GUI capabilities to develop Spark application before deploying to production. However, one of the popular notebook, Zeppelin doesn’t kill it’s Spark applications even after hours of idle time. This in turn consumes the EMR resources that’s directly mapped to Spark configs of active Zeppelin interpreters.

These are inactive interpreters and should get killed to avoid paying huge amount for AWS every month. Databricks provides option to exit the notebooks and terminate the cluster based on given timeout, if the cluster is idle. As there is no option like this in AWS, we need to develop our own ways to close the Spark application or Zeppelin notebook. By doing this, the resources consumed in EMR will be released.

There might be many ways to achieve this. But, this can be done easily through the combination YARN command line scripts and crontab.

yarn application -list -appStates RUNNING

The output(application-id) of this yarn command can be provided as an input to a shell script to kill the application id’s using below command.

yarn application -kill <application-id>

Developed shell script using yarn command needs to be scheduled using crontab to run during offline business hours and weekend.

This approach has a caveat though. It can’t be implemented during daytime (or) working hours, because it kills the Zeppelin interpreter of actively running Spark applications.