Hadoop Admin Interview Questions and Answers : Part 1
Though big data platform has introduced lot of new frameworks like Spark, Druid, Delta Lake, Hudi, Governed tables, Snowflake..etc and covered a huge distance, explored various issues after introduction of Hadoop, but still there are lot of data stores of project set up using Hadoop. So, even nowadays there are ample opportunities available for Hadoop admins. Below is the list of questions I have collated after reaching out to my colleagues who got offers in multiple product companies as an Hadoop admin. This includes a collection of HDFS, MapReduce Framework, YARN and file system commands..etc
- How to delete the checkpoint files from trash folder? Most of the cases, checkpoint files will be created by Spark Streaming jobs in filesystems like HDFS or S3 based on our configuration. Not taking proper action on these files results in space results issue. These files can be removed by hadoop fs -expunge -immediate -fs <hdfs/s3/other file systems path>
- There are two production clusters and how to achieve inter cluster data copying? This activity can be performed through distcp. It uses mapreduce for distribution, error handling, recovery and reporting. distcp expands list of files and directories to be copied into map tasks. Each map task copies a partition of files. hadoop distcp hdfs://namenode1:8000/<source_dir> hdfs://namenode2:8000/<target_dir>
- Is reducer required to complete Map-Reduce framework? No, reducer is not required in some cases. In that case, number of reduce tasks can be set to zero. Therefore, output of map-tasks directly go to filesystem. As ans example, Sqoop (tool to transfer data between Hadoop and other data stores) operates without reducer.
- Define Hadoop? Big data framework to process larger dataset in distributed fashion using commodity hardwares. Main components of Hadoop includes a) Hadoop Distributed file system (HDFS) to store data and b) MapReduce framework to process data. HDFS : Scalable and reliable distributed system built using Java. MapReduce : Operates on Master Slave Architecture.
- How to create snapshot of data stored in HDFS ? HDFS Snapshots are read-only point-in-time copies of the hadoop file system. Snaphots can be created using hdfs dfs -createSnapshot <path> <snapshotName>
- Define Rack Awareness? Physical location of data nodes is referred to Rack. Rack details for each data node is identified by Namenode. Process of selecting closer data nodes during write operation is called as Rack Awareness. Approximately, 30–40 data nodes form a rack.
- How to get the list of failed applications in yarn? yarn application -list -appStates FAILED
- How does Map and Reduce works in MapReduce framework? Map Job: Map job breaks down data sets into key-value pairs of tuples. Map output is written in local disk of each individual local node (not in HDFS) where mapper job is getting executed. Reduce Job: Reduce job combines data tuples into smaller set of tuples to produce output.
- What’s Apache Tez? Built on top of apache yarn to run complex DAG of tasks in minimal amount of time. Tez simplifies data processing by running a single Tez job which earlier took multiple MR jobs for it’s completion.
- How to set the replication factor for a given file or directory? Replication in hadoop can be specified using setrep command. hadoop fs -setrep -w 2 /foo/bar1.txt sets the replication factor of 2 for file bar1.txt. hadoop fs -setrep -w 5 /foo/bar/ sets the replication factor of 5 for all files within bar directory.
- How YARN fits into Hadoop framework? Before Yarn comes into picture (i.e) MapReduce V1 manages both Resource Management and Data Processing(Job scheduling/monitoring). Yarn splits the workload and it started to take care of Resource Management.
- What’s the command to get the status of queue named “dev_queue” in yarn? yarn queue -status dev_queue
- What are the different schedulers supported by YARN? 1) FIFO : As the name suggests, it is First In First Out, 2) Capacity: More rigid, operating on a queue base model. It defines queues with resource quotas. Jobs cannot consume extra resources, 3) FAIR: more flexible and allows for jobs to consume unused resources in the cluster.
- Define the main components of Yarn resource manager? YARN Resource Manager: has two main components 1) Scheduler and 2) Applications Manager. Scheduler is responsible for allocating resources to the various running applications. It performs neither monitoring nor tracking for the application and even doesn’t guarantees about restarting failed tasks either due to application failure or hardware failures. Application manager is responsible for accepting job submission and provides service for monitoring and restarting the application master.
- Given two yarn queues, a critical application is running slowly in one of the queue with less resources. How to move an application “project1” to an empty queue called “foo_queue”? yarn application -changeQueue project1 -queue foo_queue
- How to expedite the processing time required to launch yarn application? enableFastLaunchUploads option need to be used for uploading Application Manager dependencies to HDFS to make future launches faster.
- What’s the command to know the list of containers tagged to an application? yarn container -list <application_id> lists containers with respect to each application.
- What’s the process to redirect project logs in yarn for further debugging? In most of the production cases, a scheduled linux script downloads log s from yarn application and stores it in logs file. However, the command to extract logs is yarn logs -applicationId <application_id>
- How to run an archived java program i.e jar in yarn? yarn jar xyz.jar [mainClass] <required-args-to-run-program>
- In case of issues in namenode, how to reboot it? Mostly, issues with namenode will get self healed. But, any major failures in namenode can be fixed by manual reboot. This manual reboot can be triggered through Hadoop-daemon.sh using stop mode and then start mode. This script can be found at /sbin/ folder or specific installation directory. Hadoop-daemon.sh [start|stop] Namenode
- Define the functionality of Secondary Namenode? It merges the fsimage and edits log files periodically and keeps edit log size within a limit. Runs on a different machine than primary namenode to not disturb the main functionalities of namenode.
- What’s the main purpose of Namenode? Namenode manages the metadata of hadoop distributed filesystem. NameNode is also termed as the heart of the HDFS file system. NameNode doesn’t contains the data of the files, but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace a) fsimage - It keeps track of the latest checkpoint of the namespace, b)edits - It is a log of changes that have been made to the namespace since checkpoint.
- How to get sizes of files and directories contained in given directory? du command provides size, disk_space_consumed and full_path_name of a given directory. hadoop fs -du -<[s|h|v|x]> <path>. s, h, v, x defines the options that can be used along with du command.
- How to know the status of all nodes? Run below yarn command to fetch the list of all nodes running in cluster. yarn node -all -list
- Define HDFS Federation? Normal HDFS architecture allows only a single namespace for the entire cluster i.e single namenode manages the entire namespace. In this case, issues in that namenode brings the entire cluster down. This issue has been addressed in HDFS Federation by adding support for multiple Namenodes/namespaces to HDFS. The Namenodes are independent and do not require coordination with each other. The Datanodes are used as common storage for blocks by all the Namenodes.
- Usage: hadoop fs -count [-q] [-h] [-v] [-x] [-t [<storage type>]] [-u] [-e] [-s] <paths>
- How to count the number of directories or files in a path within Hadoop distributed file system? hadoop fs -count command along with other options helps to count the number of directories, files and bytes under the paths that match the specified file pattern.
Other useful hadoop filesystem commands includes copyFromLocal, copyToLocal, cp, put, get, du, chown, chmod, mkdir, ls, mv, rm, text.
Please stay tuned for my next set of interview questions related to big data and it’s related frameworks.