What is Distributed Cache in Hadoop?
Hadoop is a framework that is open-source and uses distributed storage as well as the processing of huge datasets using HDFS and MapReduce. It has NameNodes which store the metadata and DataNodes which store the actual data in HDFS. When we need to process the huge datasets, it is done by a program written by the users and then the processing gets done in parallel in the DataNodes. In the Hadoop framework, there are certain files that are needed by the MapReduce jobs frequently. If there are the number of mappers running, each time when it is required to read the files from HDFS, the latency will increase as seek time also will increase. So instead of reading the files each time when the files are needed, the files can be copied and sent to all the DataNodes. This mechanism is called Distributed Cache in Hadoop.
Working of Distributed Cache in Hadoop
- Hadoop copies the files which are specified by the options such as –files, -libjars and –archives to the HDFS when a job is launched. Then the Node Manager will copy the files from HDFS to the cache so that when a task runs, it can access the files. The files can be termed as localized as they are copied to the cache or the local disk.
- In the cache, the count of the number of tasks using each file is maintained as a reference by the Node Manager. The reference count of the files becomes 1 before the task runs. But after the task has run, the count is decreased by 1. When the count becomes 0, the file can be deleted as it is not getting used. When a node’s cache reaches its certain size, the deletion of a file is done so that the new files can be accommodated. The size of the cache can be changed in the configuration property. The size of the Distributed Cache in Hadoop is by default 10GB.
- The MapReduce becomes slower than in-process cache if it has overhead. In order to overcome this situation, the distributed cache can serialize the objects but even this has few problems. Reflection is a process used to investigate the information type during the run time which is very slow. Also, it becomes very difficult in serialization where it stores the complete cluster name, class name along with references to other instances present in the member variables.
Implementation of Hadoop in Distributed Cache
- To use the distributed cache for an application, we need to make sure that in order to distribute a file across the nodes; the file should be first available. So we need to copy the files to HDFS and also we need to check that the file is accessible through URIs which can be found by accessing the core-site.xml. Then the MapReduce job copies the cache file to all the nodes before the tasks start running on those nodes.
- So in order to implement distributed cache, we need to copy the files to HDFS and we can check if this is done or not through hdfs dfs –put /path/samplefile.jar command. Also, the Job Configuration needs to be set up for the application and this needs to be added to the driver class.
- The files that are readable only by the owner, go to private cache whereas the shared cache has the files which are world-readable. The file which gets added to the cache gets used without any constraint in all the machines in the cluster as a local file. The below API calls can be used to add the files into the cache.
The sharing of distributed cache files on the slave nodes is dependent upon if the Distributed Cache files are private or public. The private Distributed Cache files are cached in the local directory of the user which is private to the user and these files are required by the user’s jobs. In the case of the public Distributed cache files, the files get cached in the global directory. The accessing of files in case of public cache is set up in a way where they are visible to all the users. Also, the distributed cache file becomes private or public depending upon the permission on the file system.
Benefits of Distributed Cache in Hadoop
With the usage of the distributed cache, many advantageous features get added to the Hadoop framework. Below are the benefits of using distributed cache:
1. Distributed Cache in Single Point of Failure
In case of failure of a node, it will not make the complete cache failure. Because the distributed cache runs as a standalone or independent process across the various nodes. So if the cache failure occurs at one node, it does not mean that the complete cache should also fail.
2. Consistency of Data
By the usage of the Hash Algorithm, it can be determined which key-value pairs belong to which node. Also, the distributed cache in Hadoop monitors or tracks the timestamp modification done to the cache files and it reports that until the job has executed, a file should not change. So the data never gets inconsistent because of the single state of cache cluster.
3. Storage of Complex Data
The distributed cache in the Hadoop framework provides the advantage of caching the read-only files like text files, jar files, etc and then broadcast them to data nodes. Because of this, a copy of the file is stored in each data node. With the distributed cache feature, complex files like a jar, etc. are distributed and stored.
4.5 (2,725 ratings)
Distributed cache provides efficiency because the files are copied one time for each job. Also, it has the capacity to cache the archives which are un-archived on slaves. The usage of distributed cache is an added bonus and is dependent upon the developer to make the best use of this feature.
This is a guide to Distributed Cache in Hadoop. Here we discuss What is Distributed Cache in Hadoop, its work, implementation, with benefits. You can also go through our other related articles to learn more –