Introduction to HDFS Commands
Big data is a word for datasets that are so huge or compound that conventional data processing application software is not enough to pact with them. Hadoop is an open-source, Java-based programming framework that chains the processing and storage space of enormously bulky data sets in a disseminated computing environment. Apache software foundation is the key to installing Hadoop.
Features of HDFS
- HDFS runs on Master/slave architecture
- HDFS uses files for storing the user-related data
- holds a huge set of directories and files which are stored in a hierarchical format.
- A file is ripped into smaller blocks inside, and these blocks are stored in a set of Datanodes.
- Namenode and Datanode are the portion of software intended to run on product machines that classically run on GNU/Linux OS.
Namenode
- Here the file system is maintained by name node.
- Namenode is also responsible for logging all the file system changes moreover maintains an image of complete file system namespace and file Blockmap in memory.
- Checkpointing is done periodically. Hence easy recover to the stage before the crash point can be achieved here.
Datanode
- A Datanode provisions data in files in its local file system
- To intimate on its existence, the data node sends the heartbeat to the namenode
- A block report will be generated for every 10th heartbeat received
- Replication is implied in the data stored in these data nodes.
Data Replication
- Here the sequence of blocks form a file with a default block size of 128 MB
- All blocks in the file apart from the final are of a similar size.
- From every data nodes in the cluster, the namenode element receives a heartbeat.
- BlockReport contains all the blocks on a Datanode.
- Holds a huge set of directories and files which are stored in a hierarchical format.
- A file is ripped into smaller blocks inside, and these blocks are stored in a set of Datanodes.
- Namenode and Datanode are the portion of software intended to run on product machines that classically run on GNU/Linux OS.
Job tracker: JobTracker debate to the NameNode to conclude the position of the data. Also, locate the most acceptable TaskTracker nodes to carry out tasks based on the data locality.
Task tracker: A TaskTracker is a node in the cluster that accepts tasks – Map, Reduce and Shuffle operations – from a JobTracker.
Secondary Name node (or) checkpoint node: Gets the EditLog from the name node in regular intervals and applies to its FS image. And copies back a completed FS image to the name node during its restart. Secondary Name node’s whole purpose is to have a checkpoint in HDFS.
YARN
- YARN has a central resource manager component that manages resources and assigns the resources to every application.
- Here the Resource Manager is the master who adjudicates the resources associated with the cluster; the resource manager is coiled of two components: the application manager and a scheduler. These two components manage the jobs on the cluster systems. Another component calls the Node Manager (NM) responsible for managing the users’ jobs and workflow on a given node.
- The Standby NameNode holds an exact replication of the data inactive namenode. It acts as a slave, maintains enough state to supply a fast failover, if essential.
Basic HDFS Commands
Given Below is the basic commands:
Basic HDFS Commands |
||
Sr.No | HDFS Command Property | HDFS Command |
1 | Print Hadoop version | $ Hadoop version |
2 | List the contents of the root directory in HDFS | $ Hadoop fs -ls |
3 | Report the amount of space used and available on a currently mounted filesystem | $ Hadoop fs -df hdfs:/ |
4 | The HDFS balancer re-balances data across the DataNodes, moving blocks from over-utilized to under-utilized nodes. | $ Hadoop balancer |
5 | Help command | $ Hadoop fs -help |
Intermediate HDFS Commands
Given Below is the intermediate commands:
Intermediate HDFS Commands |
||
Sr.No | HDFS Command Property | HDFS Command |
6 | creates a directory at the specified HDFS location | $ Hadoop fs -mkdir /user/Cloudera/ |
7 | Copies data from one location to another | $ Hadoop fs -put data/sample.txt /user/training/Hadoop |
8 | See the space occupied by a particular directory in HDFS | $ Hadoop fs -du -s -h /user/Cloudera/ |
9 | Remove a directory in Hadoop | $ Hadoop fs -rm -r /user/cloudera/pigjobs/ |
10 | Removes all the files in the given directory | $ hadoop fs -rm -skipTrash hadoop/retail/* |
11 | To empty the trash | $ hadoop fs -expunge |
12 | copies data from and to local to HDFS | $ hadoop fs -copyFromLocal /home/cloudera/sample/ /user/cloudera/flume/
$ hadoop fs -copyToLocal /user/cloudera/pigjobs/* /home/cloudera/oozie/ |
Advanced HDFS Commands
Given Below is the advanced commands:
Intermediate HDFS Commands |
||
Sr.No | HDFS Command Property | HDFS Command |
13 | change file permissions | $ sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/flume/ |
14 | set data replication factor for a file | $ hadoop fs -setrep -w 5 /user/cloudera/pigjobs/ |
15 | Count the number of directories, files, and bytes under hdfs | $ hadoop fs -count hdfs:/ |
16 | make namenode exit safe mode | $ sudo -u hdfs hdfs dfsadmin -safemode leave |
17 | Hadoop format a namenode | $hadoop namenode -format |
Tips and tricks to Use HDFS Commands
1) We can achieve faster recovery when the cluster node count is higher.
2) The increase in storage per unit time increases the recovery time.
4.5 (5,999 ratings)
View Course
3) Namenode hardware has to be very reliable.
4) Sophisticated monitoring can be achieved through ambari.
5) System starvation can be decreased by increasing the reducer count.
Recommended Articles
This has been a guide to HDFS Commands. We discussed HDFS commands, features, basic, intermediate, and advanced commands with pictorial representation, with useful tips and tricks. You can also go through our other suggested articles to learn more –