Hadoop FAQs

What is NAS?

It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS.

How many maximum JVM can run on a slave node?

One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

How many daemon processes run on a Hadoop cluster?

Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.
Following 3 Daemons run on Master nodes.

NameNode - This daemon stores and maintains the metadata for HDFS.

Secondary NameNode - Performs housekeeping functions for the NameNode.

JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode – Stores actual HDFS data blocks.

TaskTracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

What do you mean by TaskInstance?

Task instances are the actual MapReduce jobs which run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the entire task tracker.Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

Explain the use of TaskTracker in the Hadoop cluster?

A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in its own JVM Process.

Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker.

The Tasktracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker.

The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

What are the two main parts of the Hadoop framework?

Hadoop consists of two main parts
 Hadoop distributed file system, a distributed file system with high throughput,
 Hadoop MapReduce, a software framework for processing large data sets.

How many instances of Tasktracker run on a Hadoop cluster?

There is one Daemon Tasktracker process for each slave node in the Hadoop cluster.

Friday, 8 March 2013