HDFS block size Vs actual file size

Asked 25/2, 2013 at 8:0 Answered 22/3, 2018 at 17:18

I know that HDFS stores data using the regular linux file system in the data nodes. My HDFS block size is 128 MB. Lets say that I have 10 GB of disk space in my hadoop cluster that means, HDFS initially has 80 blocks as available storage.

If I create a small file of say 12.8 MB, #available HDFS blocks will become 79. What happens if I create another small file of 12.8 MB? Will the #availbale blocks stay at 79 or will it come down to 78? In the former case, HDFS basically recalculates the #available blocks after each block allocation based on the available free disk space so, #available blocks will become 78 only after more than 128 MB of disk space is consumed. Please clarify.

Tameshatamez answered 25/2, 2013 at 8:0 Comment(0)

The best way to know is to try it, see my results bellow.

But before trying, my guess is that even if you can only allocate 80 full blocks in your configuration, you can allocate more than 80 non-empty files. This is because I think HDFS does not use a full block each time you allocate a non-empty file. Said in another way, HDFS blocks are not a storage allocation unit, but a replication unit. I think the storage allocation unit of HDFS is the unit of the underlying filesystem (if you use ext4 with a block size of 4 KB and you create a 1 KB file in a cluster with replication factor of 3, you consume 3 times 4 KB = 12 KB of hard disk space).

Enough guessing and thinking, let's try it. My lab configuration is as follow:

hadoop version 1.0.4
4 data nodes, each with a little less than 5.0G of available space, ext4 block size of 4K
block size of 64 MB, default replication of 1

After starting HDFS, I have the following NameNode summary:

1 files and directories, 0 blocks = 1 total
DFS Used: 112 KB
DFS Remaining: 19.82 GB

Then I do the following commands:

hadoop fs -mkdir /test
for f in $(seq 1 10); do hadoop fs -copyFromLocal ./1K_file /test/$f; done

With these results:

12 files and directories, 10 blocks = 22 total
DFS Used: 122.15 KB
DFS Remaining: 19.82 GB

So the 10 files did not consume 10 times 64 MB (no modification of "DFS Remaining").

Lu answered 25/2, 2013 at 10:51 Comment(1)

This is what I was guessing. Now it is clearer. Thanks for the detailed explanation and the experiment! – Tameshatamez 25/2, 2013 at 17:45

HDFS use only what it needs on the local file system. So block, representing 12 MB file will take 12 MB when stored (on each datanode where it is stored). So you will be able to have as much blocks as you need assuming you have space for the data.

Knut answered 25/2, 2013 at 11:41 Comment(2)

But I think HDFS decides whether it has enough free space in terms of the #available blocks. Hypothetically, if we have disk space of 128 MB and create an 1MB file, then #available blocks becomes 0 (since 127 MB can not make up a complete HDFS block) and HDFS will not be able to create another 1MB file even though there is enough disk space. Does that sound correct? – Tameshatamez 25/2, 2013 at 17:42

From my experiance - HDFS will try to create block and will return error on out of space on concrete nodes. – Knut 25/2, 2013 at 20:51

The 'available blocks' will stay at 79 (see this question). Anyway I don't think HDFS decides whether it has enough free space in terms of the 'available blocks'.

Rixdollar answered 29/3, 2013 at 3:11 Comment(0)

HDFS block size and Ext block size are not the same thing. The easiest way to put it is HDFS block size is "replication" block size, not "storage" block size.

For storage it will use the same amount of space as your local file system does,because that's what it uses, but it will copy not less, than one block between nodes, even if only 1KB is used

Zohara answered 22/3, 2018 at 17:18 Comment(0)

Recommended topics

Hot tags