HDFS Architecture Guide – Apache Hadoop

HDFS is designed to reliably store very large files on machines in a large cluster. Store each file as a sequence of blocks; All blocks in a file, except the last one, are the same size. Blocks in a file are replicated for fault tolerance. Block size and replication factor can be configured per file. An application can specify the number of replicas of a file. The replication factor can be specified at the time of file creation and can be changed later. Files in HDFS are written once and strictly have a writer at any time.

The NameNode makes all decisions regarding block replication. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receiving a Heartbeat implies that the DataNode is working properly. A Blockreport contains a list of all the blocks in a DataNode.


placement: The first steps Replica

placement is critical to HDFS reliability and performance. Replica placement optimization distinguishes HDFS from most other distributed file systems. This is a feature that needs a lot of tweaking and experience. The purpose of a rack-compatible replica location policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation of the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it in production systems, learn more about its behavior, and build a basis for testing and researching more sophisticated policies.

Large HDFS instances run on a cluster of computers that are typically spread across many racks. Communication between two nodes in different racks has to go through switches. In most cases, the network bandwidth between machines in the same rack is greater than the network bandwidth between machines in different racks.

The NameNode determines the rack identifier to which each DataNode belongs through the process described in Hadoop Rack Awareness. A simple but not optimal policy is to place replicas in single racks. This prevents data loss when an entire rack fails and allows the use of multi-rack bandwidth when reading data. This policy evenly distributes replicas across the cluster, making it easier to balance the load in the event of component failure. However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.

For the common case, when the replication factor is three, the HDFS placement policy is to place one replica on a node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts off write traffic behind the scenes, which generally improves write performance. The probability of rack failure is much lower than that of node failure; This policy does not affect the guarantees of reliability and availability of the data. However, it reduces the aggregate network bandwidth used when reading data, as a block is placed in only two single racks instead of three. With this policy, replicas of a file are not evenly distributed behind the scenes. One-third of the replicas are on a node, two-thirds of the replicas are in a rack, and the other third is evenly distributed among the remaining racks. This policy improves write performance without compromising data reliability or read performance.

The current default replica location policy

described here is a work in progress

. Replica selection To minimize overall bandwidth consumption and read latency, HDFS attempts to satisfy a read


from a replica closer to the reader. If a replica exists in the same rack as the reader node, that replica is preferred to satisfy the read request. If your angg/HDFS cluster spans multiple data centers, a replica residing in the local data center is preferred to any remote replica.



When started, the NameNode enters a special state called Safe Mode. Data block replication does not occur when NameNode is in the Safemode state. NameNode receives Heartbeat and Blockreport messages from DataNodes. A Blockreport contains the list of data blocks hosted by a DataNode. Each block has a specified minimum number of replicas. A block is considered securely replicated when the minimum number of replicas of that data block has been registered with NameNode. After a configurable percentage of securely replicated data blocks is recorded with NameNode (plus an additional 30 seconds), NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have less than the specified number of replicas. NameNode then replicates these blocks to other DataNodes.

Contact US