Introduction To Hadoop Admin Interview Questions And Answers
So you have finally found your dream job in Hadoop Admin but are wondering how to crack the Hadoop Admin Interview and what could be the probable Hadoop Admin Interview Questions. Every interview is different and the scope of a job is different too. Keeping this in mind we have designed the most common Hadoop Admin Interview Questions and Answers to help you get success in your interview.
Following are the Hadoop Admin Interview Questions that will help you in cracking an interview with Hadoop.
1. What is rack awareness? And why is it necessary?
Rack awareness is about distributing data nodes across multiple racks.HDFS follows the rack awareness algorithm to place the data blocks. A rack holds multiple servers. And for a cluster, there could be multiple racks. Let’s say there is a Hadoop cluster set up with 12 nodes. There could be 3 racks with 4 servers on each. All 3 racks are connected so that all 12 nodes are connected and that form a cluster. While deciding on the rack count, the important point to consider is the replication factor. If there is 100GB of data that is going to flow every day with the replication factor 3. Then it’s 300GB of data that will have to reside on the cluster. It’s a better option to have the data replicated across the racks. Even if any node goes down, the replica will be in another rack.
2. What is the default block size and how is it defined?
128MB and it is defined in hdfs-site.xml and also this is customizable depending on the volume of the data and the level of access. Say, 100GB of data flowing in a day, the data gets segregated and stored across the cluster. What will be the number of files? 800 files. (1024*100/128) [1024 à converted a GB to MB.] There are two ways to set the customize data block size.
- hadoop fs -D fs.local.block.size=134217728 (in bits)
- In hdfs-site.xml add this property à block.size with the bits size.
If you change the default size to 512MB as the data size is huge, then the no.of files generated will be 200. (1024*100/512)
3. How do you get the report of hdfs file system? About disk availability and no.of active nodes?
Command: sudo -u hdfs dfsadmin –report
These are the list of information it displays,
- Configured Capacity – Total capacity available in hdfs
- Present Capacity – This is the total amount of space allocated for the resources to reside beside the metastore and fsimage usage of space.
- DFS Remaining – It is the amount of storage space still available to the HDFS to store more files
- DFS Used – It is the storage space that has been used up by HDFS.
- DFS Used% – In percentage
- Under replicated blocks – No. of blocks
- Blocks with corrupt replicas – If any corrupted blocks
- Missing blocks
- Missing blocks (with replication factor 1)
4. What is Hadoop balancer and why is it necessary?
The data spread across the nodes are not distributed in the right proportion, meaning the utilization of each node might not be balanced. One node might be over utilized and the other could be under-utilized. This leads to having high costing effect while running any process and it would end up running on heavy usage of those nodes. In order to solve this, Hadoop balancer is used that will balance the utilization of the data in the nodes. So whenever a balancer is executed, the data gets moved across where the under-utilized nodes get filled up and the over utilized nodes will be freed up.
5. Difference between Cloudera and Ambari?
|Administration tool for Cloudera||Administration tool for Horton works|
|Monitors and manages the entire cluster and reports the usage and any issues||Monitors and manages the entire cluster and reports the usage and any issues|
|Comes with Cloudera paid service||Open source|
6. What are the main actions performed by the Hadoop admin?
Monitor health of cluster -There are many application pages that have to be monitored if any processes run. (Job history server, YARN resource manager, Cloudera manager/ambary depending on the distribution)
turn on security – SSL or Kerberos
Tune performance – Hadoop balancer
Add new data nodes as needed – Infrastructure changes and configurations
Optional to turn on MapReduce Job History Tracking Server à Sometimes restarting the services would help release up cache memory. This is when the cluster with an empty process.
7. What is Kerberos?
It’s an authentication required for each service to sync up in order to run the process. It is recommended to enable Kerberos. Since we are dealing with the distributed computing, it is always good practice to have encryption while accessing the data and processing it. As each node are connected and any information passage is across a network. As Hadoop uses Kerberos, passwords not sent across the networks. Instead, passwords are used to compute the encryption keys. The messages are exchanged between the client and the server. In simple terms, Kerberos provides identity to each other (nodes) in a secure manner with the encryption.
Configuration in core-site.xml
8. What is the important list of hdfs commands?
|hdfs dfs –ls <hdfs path>||To list the files from the hdfs filesystem.|
|Hdfs dfs –put <local file> <hdfs folder>||Copy file from the local system to the hdfs filesystem|
|Hdfs dfs –chmod 777 <hdfs file>||Give a read, write, execute permission to the file|
|Hdfs dfs –get <hdfs folder/file> <local filesystem>||Copy the file from hdfs filesystem to the local filesystem|
|Hdfs dfs –cat <hdfs file>||View the file content from the hdfs filesystem|
|Hdfs dfs –rm <hdfs file>||Remove the file from the hdfs filesystem. But it will be moved to trash file path (it’s like a recycle bin in windows)|
|Hdfs dfs –rm –skipTrash <hdfs filesystem>||Removes the file permanently from the cluster.|
|Hdfs dfs –touchz <hdfs file>||Create a file in the hdfs filesystem|
9. How to check the logs of a Hadoop job submitted in the cluster and how to terminate already running process?
yarn logs –applicationId <application_id> — The application master generates logs on its container and it will be appended with the id it generates. This is will be helpful to monitor the process running status and the log information.
yarn application –kill <application_id> — If an existing process that was running in the cluster needs to be terminated, kill command is used where the application id is used to terminate the job in the cluster.
This has been a guide to List Of Hadoop Admin Interview Questions and Answers so that the candidate can crackdown these Hadoop Admin Interview Questions easily. You may also look at the following articles to learn more