Namenode file quantity limit
Asked Answered
D

3

6

Any one know how many bytes occupy per file in namenode of Hdfs? I want to estimate how many files can store in single namenode of 32G memory.

Douglass answered 26/5, 2012 at 7:36 Comment(1)
Namenode is just capped by a Java HashMaps capacity. If you can't scale your Namenode more vertically than 32g, then this will be your bottleneck.Foretoken
P
12

Each file or directory or block occupies about 150 bytes in the namenode memory. [1] So a cluster with a namenode with 32G RAM can support a maximum of (assuming namenode is the bottleneck) about 38 million files. (Each file will also take up a block, so each file takes 300 bytes in effect. I am also assuming 3x replication. So each file takes up 900 bytes)

In practice however, the number will be much lesser because all of the 32G will not be available to the namenode for keeping the mapping. You can increase it by allocating more heap space to the namenode in that machine.

Replication also effects this to a lesser degree. Each additional replica adds about 16 bytes to the memory requirement. [2]

[1] https://blog.cloudera.com/small-files-big-foils-addressing-the-associated-metadata-and-application-challenges/

[2] http://search-hadoop.com/c/HDFS:/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfo.java%7C%7CBlockInfo

Paramatta answered 26/5, 2012 at 10:54 Comment(3)
dfs.replication can effect this number?Douglass
I know this is old, but I have been attempting to work out the byte math up top and coming up short. Could you explain how you get 26M? I get the 26M by doing the following (32,000,000,000/150)/8 but I don't understand why I had to divide by 8 if 1 file block dir = 150 bytes not bits. What crucial point am I missing?Sardine
I assumed a replication factor of 3. Each file would use up a block, so each file would use up 300*3 = 900 bytes. But that would give about 38 million files. So I think there is a mistake in my 26M calculation. It has been a long time, but I guess I carelessly assumed that each file would use up space for one directory as well, because it comes out to be around 26M if I assume each files takes 450*3 bytes (150 bytes for the directory as well). Corrected it now. Thanks!Paramatta
S
8

Cloudera recommends 1 GB of NameNode heap space per million blocks. 1 GB for every million files is less conservative but should work too.

Also you don't need to multiply by a replication factor, an accepted answer is wrong.

Using the default block size of 128 MB, a file of 192 MB is split into two block files, one 128 MB file and one 64 MB file. On the NameNode, namespace objects are measured by the number of files and blocks. The same 192 MB file is represented by three namespace objects (1 file inode + 2 blocks) and consumes approximately 450 bytes of memory.

One data file of 128 MB is represented by two namespace objects on the NameNode (1 file inode + 1 block) and consumes approximately 300 bytes of memory. By contrast, 128 files of 1 MB each are represented by 256 namespace objects (128 file inodes + 128 blocks) and consume approximately 38,400 bytes.

Replication affects disk space but not memory consumption. Replication changes the amount of storage required for each block but not the number of blocks. If one block file on a DataNode, represented by one block on the NameNode, is replicated three times, the number of block files is tripled but not the number of blocks that represent them.

Examples:

  1. 1 x 1024 MB file 1 file inode 8 blocks (1024 MB / 128 MB) Total = 9 objects * 150 bytes = 1,350 bytes of heap memory
  2. 8 x 128 MB files 8 file inodes 8 blocks Total = 16 objects * 150 bytes = 2,400 bytes of heap memory
  3. 1,024 x 1 MB files 1,024 file inodes 1,024 blocks Total = 2,048 objects * 150 bytes = 307,200 bytes of heap memory

Even more examples article in the origin article from cloudera.

Sesquipedalian answered 1/9, 2016 at 20:55 Comment(0)
P
4

(Each file metadata = 150bytes) + (block metadata for the file=150bytes)=300bytes so 1million files each with 1 block will consume=300*1000000=300000000bytes =300MB for replication factor of 1. with replication factor of 3 it requires 900MB.

So as thumb rule for every 1GB you can store 1million files.

Platinumblond answered 25/7, 2015 at 20:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.