Menu Close

Hadoop interview questions

What are vendor-specific Hadoop distributions?

  • The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).

What is Map Reduce?

  • For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce framework is used. 
  • Data analysis uses two-step map and reduce process.

What is NameNode in Hadoop?

  • NameNode in which Hadoop stores all the file location information in HDFS.
  • In other words, NameNode is the centerpiece of an HDFS file system. 
  • It keeps the record of all the files in the file system and tracks the file data across the cluster or multiple machines

Explain what is heartbeat in HDFS?

  • Heartbeat is referred to a signal used between a data node and Name node.
  • Heartbear is a task tracker and job tracker.
  • If the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker

What happens when a data node fails?

  • Jobtracker and namenode  will detect the failure.
  • On the failed node all tasks are re-scheduled.
  • Namenode replicates the user’s data to another node.

What is WebDAV in Hadoop?

  • To support editing and updating files WebDAV is a set of extensions to HTTP. 
  • On most operating system WebDAV shares can be mounted as filesystems, so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.

Hadoop’s configuration files?

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

Differentiate FileSystem and HDFS?

  • Regular FileSystem: In regular FileSystem, data is maintained in a single system.
  • If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
  • HDFS: Data is distributed and maintained on multiple systems.
  • If a DataNode crashes, data can still be recovered from other nodes in the cluster.
  • Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.

How can we restart NameNode?

  • Stop the NameNode with ./sbin / stop NameNode command.
  • Start the NameNode using ./sbin/ start NameNode command.

How we restart all daemons in Hadoop?

  • Stop all the daemons with ./sbin / command and then start the daemons using the ./sbin/ command.

How HDFC cluster behaves when we store lot of files?

  • Storing several small files on HDFS generates a lot of metadata files.
  • To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata.
  • Thus, the cumulative size of all the metadata will be too large.

Explain MapReduce distributed cache?

  • A distributed cache is a mechanism wherein the data coming from the disk can be cached and made available for all worker nodes.
  • When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing.

Give a not on Hadoop streaming?

  • Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc.
  • This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers. The latest tool for Hadoop streaming is Spark.

What is Big Data?

  • Big data is finite amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems.
  • Big data is characterized by its high velocity, volume and variety that require cost effective and innovative methods for information processing to draw meaningful business insights.

Explain what is a Task Tracker in Hadoop?

  • A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker.
  • It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.

What are the core concepts of the Hadoop framework?

  • HDFS: HDFS or Hadoop Distributed File System is a Java-based reliable file system used for storing vast datasets in the block format. The Master-Slave Architecture powers it.
  • MapReduce: MapReduce is a programming structure that helps process large datasets. This function is further broken down into two parts – while ‘map’ segregates the datasets into tuples, ‘reduce’ uses the map tuples and creates a combination of smaller chunks of tuples.

What are Active and Passive NameNodes?

  • A high-availability Hadoop system usually contains two NameNodes – Active NameNode and Passive NameNode.
  • The NameNode that runs the Hadoop cluster is called the Active NameNode and the standby NameNode that stores the data of the Active NameNode is the Passive NameNode.
  • The purpose of having two NameNodes is that if the Active NameNode crashes, the Passive NameNode can take the lead. Thus, the NameNode is always running in the cluster, and the system never fails.