Wednesday, January 4, 2012

HDFS - Write Anatomy

Create and Write of HDFS file

Creation and writing of a file is more complicated than the read of a HDFS file. Here also NameNode(NN) never writes any data directly to DataNodes(DN). It, as per it's role, only manages the namespace and inodes. Client has to write directly to datanode. However each datanodes has to notify receipt of each block back to client and namenode. Also each datanode passes on the block to next datanode to write, that means client has to transmit block to only first datanode and rest of the block movement is handled inside the cluster. Here is the flow of data file create and write on HDFS.  



Other facts about HDFS write:
  • Interestingly client and NN does not wait till all the replicas of the block are acknowledged, it only ensure that at-least one copy of file is completely on the cluster.
  • Another fact is that client does not start transmitting data till it has one block with it, i.e. the DFSClient keep buffering data locally till one block of data or the end-of-file is reached.
  • The above illustration assumes that replication factor of files is set to three. Accordingly if the replication factor is more that or less than three, steps 6-10 would be repeated accordingly. BTW, the minimum replication factor or a file can be 1 and max can be 512.  However the default value is three.
  • Also this illustration is till before Hadoop version 0.23. I still have to look into the federated NN architecture.
  • While the lease is with a particular client, no other client can write to this file, or delete the file. However it can read the file. 
  • Write to the file can happen only at the end of the file, i.e. only append is allowed. That too is available only on ver 0.20.205 and beyond. The feature was there in earlier versions as well, however was not tested and disabled by default. 

There are some other concepts like block and replica management, which I will cover in my next post. 

3 comments:

  1. Great blog. So, technically, the client can be writing multiple blocks at one time i.e. as long as there is one copy of a block written out to the disk it wil move to the next block for writing. Is that true? Also, in situations where only one block was written and minimum copy is 2, does the client try to rewrite that block again. Also, is it true that when a mins replicas are written, ut is responsibility of Nomemanager to replicate the rest as normal recovery process. Please clarify.

    ReplyDelete
  2. The client is done as soon as it gets at least one ack for each block. Thereafter it's responsibility of replicate.

    ReplyDelete
  3. By saying "Thereafter it's responsibility of replicate", you are saying that the normal replication strategy comes into picture as if it was underreplicated. Correct?

    Then the question when does minReplicas come into picture.

    By the way, is there a place I can call you at ?

    ReplyDelete