Hadoop Cluster Interview Questions and Answers
This article aims to help all the Big Data aspirants answer all the Hadoop Cluster Interview questions related to setup Big Data Environment in an organization. This questionnaire will be helping out in setting up Data Nodes, Name Node and defining the capacity of Big Data daemons’ hosted server.
So if you have finally found your dream job in Hadoop Cluster but are wondering how to crack the Hadoop Cluster interview and what could be the probable Hadoop Cluster Interview Questions, every interview is different, and the scope of a job is different too. Keeping this in mind, we have designed the most common Hadoop Cluster Interview Questions and Answers to help you get success in your interview.
Some of the most important Hadoop Cluster Interview Questions that are frequently asked in an interview are as follows:
Top 10 Hadoop Cluster Interview Question and Answers
The topmost 10 Hadoop Cluster interview question and answers are listed below.
1. What are the major Hadoop components in Hadoop cluster?
Answer:
Hadoop is a framework where we process big data or Hadoop is the platform where one can process the huge amount of data on commodity servers. Hadoop is a combination of many components. Following are the major components in a Hadoop environment.
Name Node: The Master Node takes care of all the Data Nodes information and data storage location in the metadata format.
Secondary Name Node: It works as primary Name Node if the Primary Name Node goes down.
HDFS (Hadoop Distributed File System): It takes care of all Hadoop cluster storage.
Data Nodes: Data Nodes are slave nodes. Actual data gets saved on Slave Nodes for processing.
YARN (Yet Another Resource Negotiator): An Software framework for writing the applications and to process vast amounts of data. It provides the same features as MapReduce additionally it would allow each batch job to run parallelly in Hadoop cluster.
2. How to plan data storage in Hadoop Cluster?
Answer:
Storage is based on formula {Storage = Daily data ingestion*Replication}.
If Hadoop cluster is getting data 120 TB on a daily basis and we have default replication factor so the daily data storage requirement would be
Storage requirement = 120 TB (daily data ingestion) *3 (default replication) => 360 TB
As a result, we need to set up at least 360 TB data cluster for daily data ingestion requirement.
Storage also depends upon the data retention requirement. In case we want data to be stored for 2 years in the same cluster, so we need to arrange data nodes as per the retention requirement.

4.5 (9,663 ratings)
View Course
3. Calculate Numbers of Data Node
Answer:
We need to calculate a number of data nodes required for Hadoop cluster. Suppose we have servers with JBOD of 10 disks and each disk is having 4 TB storage size, so each server has 40 TB storage. Hadoop cluster is getting data 120 TB per day and 360 TB after applying the default replication factor.
No of Data Nodes = Daily data ingestion/data node capacity
No of Data Nodes = 360/40 => 9 Data Nodes
Hence for the Hadoop cluster to get 120 TB data with the above configuration, set up 9 data Nodes only.
4. How to change the replication factor in Hadoop cluster?
Answer:
Edit hdfs-site.xml file. Default path is under conf/ folder of the Hadoop installation directory. change/add following property in hdfs-site.xml:
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
It’s not mandatory to have replication factor 3. It can be set as 1 also. Replication factor 5 also works in Hadoop cluster. Setting up default value makes cluster more efficient and minimum hardware is required.
Increasing replication factor would increase Hardware requirement cause the data storage gets multiply by replication factor.
5. What is the default data block size in Hadoop and how to modify it?
Answer:
Block size cut down/divide the data into blocks and save it onto different-different data nodes.
By default, Block size is 128 MB (in Apache Hadoop), and we can modify the default block size.
Edit hdfs-site.xml file. Default path is under conf/ folder of the Hadoop installation directory. change/add following property in hdfs-site.xml:
<property>
<name>dfs.block.size<name>
<value>134217728<value>
<description>Block size<description>
<property>
block size in bytes is 134,217,728 or 128MB. Also, Specify the size with suffix (case-insensitive) such as k (kilo-), m (mega-), g (giga-) or t (tera-) to set the block size in KB, MB, TB etc…
6. How long Hadoop cluster should keep a deleted HDFS file in the delete/trash directory?
Answer:
The “fs.trash.interval” is the parameter that specifies how long HDFS can keep any deleted file in Hadoop environment to retrieve the deleted file.
Interval period can be defined in minutes only. For 2 days retrieval interval, we need to specify the property in a flowing format.
Edit the file core-site.xml and add/modify it using the following property
<property>
<name>fs.trash.interval</name>
<value>2880</value>
</property>
By default, the retrieval interval is 0 but Hadoop Administrator can add/modify above property as per requirement.
7. What are the basic commands to Start and Stop Hadoop daemons?
Answer:
All the commands to start and stop the daemons stored in sbin/ folder.
./sbin/stop-all.sh – To stop all the daemons at once.
Hadoop-daemon.sh start name node
Hadoop-daemon.sh start data node
yarn-daemon.sh, start resource manager
yarn-daemon.sh, start node manager
mr-jobhistory-daemon.sh start history server
8. What is the property to define memory allocation for tasks managed by YARN?
Answer:
Property “yarn.nodemanager.resource.memory-mb” needs to be modified/added to change the memory allocation for all the tasks that managed by YARN.
It specifies the amount of RAM in MB. Data Nodes takes 70% of actual RAM to be used for YARN. Data node with 96 GB will use 68 GB for YARN, rest of the RAM is used by Data Node daemon for “Non-YARN-Work”
Edit the file “yarn.xml file” and add/modify the following property.
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>68608</value>
</property>
yarn.nodemanager.resource.memory-mb default value is 8,192MB (8GB). If Data Nodes have large RAM capacity we must change to value to upto 70% else we’ll be wasting our memory.
9. What are the recommendations for Sizing the Name Node?
Answer:
Following details are recommended for setting up Master Node at a very initial stage.
Processors: For processes, single CPU with 6-8 cores is enough.
RAM Memory: For data and job processing server should have at least 24-96GB RAM.
Storage: Since no HDFS data is stored on the Master node. You can 1-2TB as local storage.
Since it’s difficult to decide future workloads, design your cluster by selecting hardware such as CPU, RAM, and memory easily upgradeable over time.
10. What are the default ports in the Hadoop cluster?
Answer:
Daemon Name | Default Port No |
Name Node. | 50070 |
Data Nodes. | 50075 |
Secondary Name Node. | 50090 |
Backup/Checkpoint node. | 50105 |
Job Tracker. | 50030 |
Task trackers. | 50060 |
Recommended Articles
This has been a guide to List of Hadoop Cluster Interview Questions and Answers. Here we have listed the best 10 interview sets of questions so the jobseeker can crack the interview with ease. You may also look at the following articles to learn more –