HDFS Architecture have three key components:
1. NameNode
2. Data Nodes
3. Secondary NameNode
NameNode is master node that controls the whole cluster and division of files into block. Typical block size is 64MB, however it can be configured using parameter which are in <HADOOP_INSTALL>/conf/hdfs-site.xml. And how many copies of each block are to be kept in cluster, is also a configurable parameter in same file. Namenode keeps two data structures:
1. Namespace (filename to blocks mapping)
2. Inodes (block to datanode mapping)
Namespace and inodes are always in memory so that it can referenced quickly. However only namespace is persisted to hard disk and on restart of cluster, inodes are created by namenode based on the information is gets from each data node periodically. Namenode never initiates an interaction with datanode, instead datanodes keep sending heartbeat to namenode and in response namenode also send the tasks to be performed by each data node. The communication between namenode and datanodes happen on RPC.
DataNode act as slave in cluster and only stores the file blocks. It has no knowledge of block to file mapping. It only stores the blocks and acts based on command it receives from namenode. Some of the command it receives are replicating or deleting the under/over-replicated blocks. It also has to send the heartbeat at regular interval to namenode to be able to keep participating in the cluster. It also sends the block report to namenode periodically. Data Nodes talk to each other directly to move the data blocks.
Secondary Namenode is a sort of misnomer, as from name it might be concluded that it's hot standby, however it's just a edit logs collection node. i.e. it just keep on getting the changes done by namenode in namespace and keep those collecting to reduce the overhead from namenode.
I will talk about write and read anatomy in my next post.
1. NameNode
2. Data Nodes
3. Secondary NameNode
NameNode is master node that controls the whole cluster and division of files into block. Typical block size is 64MB, however it can be configured using parameter which are in <HADOOP_INSTALL>/conf/hdfs-site.xml. And how many copies of each block are to be kept in cluster, is also a configurable parameter in same file. Namenode keeps two data structures:
1. Namespace (filename to blocks mapping)
2. Inodes (block to datanode mapping)
Namespace and inodes are always in memory so that it can referenced quickly. However only namespace is persisted to hard disk and on restart of cluster, inodes are created by namenode based on the information is gets from each data node periodically. Namenode never initiates an interaction with datanode, instead datanodes keep sending heartbeat to namenode and in response namenode also send the tasks to be performed by each data node. The communication between namenode and datanodes happen on RPC.
DataNode act as slave in cluster and only stores the file blocks. It has no knowledge of block to file mapping. It only stores the blocks and acts based on command it receives from namenode. Some of the command it receives are replicating or deleting the under/over-replicated blocks. It also has to send the heartbeat at regular interval to namenode to be able to keep participating in the cluster. It also sends the block report to namenode periodically. Data Nodes talk to each other directly to move the data blocks.
Secondary Namenode is a sort of misnomer, as from name it might be concluded that it's hot standby, however it's just a edit logs collection node. i.e. it just keep on getting the changes done by namenode in namespace and keep those collecting to reduce the overhead from namenode.
I will talk about write and read anatomy in my next post.