data block size in HDFS, why 64MB?
Asked Answered
W

8

52

The default data block size of HDFS/Hadoop is 64MB. The block size in the disk is generally 4KB.

What does 64MB block size mean? ->Does it mean that the smallest unit of reading from disk is 64MB?

If yes, what is the advantage of doing that?-> easy for continuous access of large files in HDFS?

Can we do the same by using the disk's original 4KB block size?

Writing answered 20/10, 2013 at 3:56 Comment(0)
E
82

What does 64MB block size mean?

The block size is the smallest data unit that a file system can store. If you store a file that's 1k or 60Mb, it'll take up one block. Once you cross the 64Mb boundary, you need a second block.

If yes, what is the advantage of doing that?

HDFS is meant to handle large files. Let's say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000 requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead. Each request has to be processed by the Name Node to determine where that block can be found. That's a lot of traffic! If you use 64Mb blocks, the number of requests goes down to 16, significantly reducing the cost of overhead and load on the Name Node.

Erythroblastosis answered 20/10, 2013 at 4:6 Comment(18)
thanks for your answer. Assume block size is 4KB and a file is store in continuous blocks in the disk. Why can't we retrieve 1000 MB file by using 1 request? I know may be currently HDFS doesn't support such access method. But what the problem of such access method?Writing
In the case of small files, lets say that you have a bunch of 1k files, and your block size is 4k. That means that each file is wasting 3k, which is not cool. - this is not true in case of HDFS. Lets say the file is 100MB, then the blocks are 64MM and 36BM. Usually the size of the last block is less unless the file is a multiple of 64MB.Principally
@Writing HDFS could retrieve a 1000 MB file in one request if the block sizes were at least 1000 MB.Erythroblastosis
Does this mean that, if I store a 1mb file in HDFS with a block size of 64mb, it will take up 64mb of HDFS storage capacity?Linotype
@Linotype No, a 1Mb file will not take up 64Mb on disk.Erythroblastosis
This answer is just plain wrong. What "block" or "block size" means is dependent on the file system and in the case of HDFS it does not mean the smallest unit it can store, it's the smallest unit the namenode references. And a block is usually stored sequentially on a physical disk, which makes reading and writing a block fast. For small files the block size doesn't matter much, because they will be smaller than the blocksize anyway and stored as a smaller block. So bigger block sizes are generally better but one has to weigh that against the desired amount of data and mapper distribution.Genu
@DavidOngaro Saying that the block size is the smallest unit that a namenode references is correct...my explanation is a slight oversimplification. I'm not sure why that makes the answer 'just plain wrong,' though.Erythroblastosis
@bstempi: At least 3 things are wrong: 1. Most files are not a multiple of the block size so the last block is actually smaller than the default block size (and yes its referenced by the namenode). 2. There is not such a thing as the "block size", since the block size is a file attribute and can therefore be different from file to file (there is a "default block size" though, and that is what the question was about). 3. You give wrong advice, if you already have mostly small files making the block size even smaller would make things worseGenu
@DavidOngaro I agree with your third point and will update my answer when I'm not on mobile. Your first point is an assumption on your part..I never stated that a partial block takes the space of a full block. The second point is just nit picking; the vast majority of people don't change this attribute from file to file, so I didn't think it was worth mentioning.Erythroblastosis
@bstempi: The second point is not nitpicking, if you change the default block size it leaves your old files at the old blocksize unless you migrate them explicitly with distcp and it's important to understand that. So even if you never set it on a file individual level you can end up with files of different block sizes. Also it proves the first point: if there is no single blocksize, the namenode can not use the "default blocksize" as a smallest addressable unit. Not need for making assumptions here.Genu
@DavidOngaro: but that's not the question that was asked. The question was about the default block size. If you take that much issue with my answer, then I suggest you submit an edit our submit your own answer.Erythroblastosis
@bstempi, You had mentioned that using 64Mb blocks, the number of request would be reduced to 16. But the operating system would process data in 4kb blocks. It wont be processing the data in 64mb. How does that work in background? How can we reduce the seek time with 64mb block when the operating system would only process 4kb block at a time?Systaltic
@BasilPaul: I now tried to answer your question in aspect 3 of https://mcmap.net/q/353898/-how-to-set-data-block-size-in-hadoop-is-it-advantage-to-change-it. Basically disk seek time doesn't matter much when you already have a network transfer going on, but TCP throughput matters. So I wouldn't argue that the advantage is the lower number of "requests", but the lower number of persistent TCP connections.Genu
@DavidOngaro, First one regarding Disk throughput. You had mentioned that data would be written sequentially to the disk. So if i have 64 mb of data in slave node,8kb blocks of 64mb data would be formed sequentially in the disk of slave node to cut down seek time. When the data is written or read it would process in 8kb blocks? Is that correct?Systaltic
The goal of using a larger block size is to reduce the number of requests to the name node. So, someone would say, "where can I find this file?" to the name node, which would say, "You can find this 64Mb chunk here, and this one here, etc." The requestor would go to each place that a chunk is stored and say, "hey, give me a copy of that chunk." What HDFS recognizes as a block is not the same as the underlying FS. If the underlying FS uses 4k blocks, then that 64Gb chunk would take ~67 million blocks. That's fine because the latency on those are low, unlike the latency to the name node.Erythroblastosis
To sate it differently: The goal is not to reduce the operating system's burden in storing the blocks; it's to reduce how much thinking the name node has to do when it comes to remembering where each piece of the file is. The name node is much slower than a local OS's file system.Erythroblastosis
@Erythroblastosis : So, how these chunks are handled by OS? Do we have seek (not skip) option in HDFS to access a byte position at a random point?Illsorted
Is there any tool available to visually look at how these chunks are stored at OS file system on the respective data nodes?Illsorted
D
25

HDFS's design was originally inspired by the design of the Google File System (GFS). Here are the two reasons for large block sizes as stated in the original GFS paper (note 1 on GFS terminology vs HDFS terminology: chunk = block, chunkserver = datanode, master = namenode; note 2: bold formatting is mine):

A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. [...] Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.

Finally, I should point out that the current default size in Apache Hadoop is is 128 MB (see dfs.blocksize).

Danelaw answered 21/10, 2013 at 14:56 Comment(0)
S
4

In HDFS the block size controls the level of replication declustering. The lower the block size your blocks are more evenly distributed across the DataNodes. The higher the block size your data are potentially less equally distributed in your cluster.

So what's the point then choosing a higher block size instead of some low value? While in theory equal distribution of data is a good thing, having a too low blocksize has some significant drawbacks. NameNode's capacity is limited, so having 4KB blocksize instead of 128MB means also having 32768 times more information to store. MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores, but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task.

Statolith answered 21/10, 2015 at 8:5 Comment(2)
From "MapReduce could also profit from equally distributed data by launching more map tasks on more NodeManager and more CPU cores" - means map reduce task is applied over huge amount of data?Illsorted
I couldn't clearly get you here " but in practice theoretical benefits will be lost on not being able to perform sequential, buffered reads and because of the latency of each map task". Can you please elaborate on this?Illsorted
B
3

In normal OS block size is 4K and in hadoop it is 64 Mb. Because for easy maintaining of the metadata in Namenode.

Suppose we have only 4K of block size in hadoop and we are trying to load 100 MB of data into this 4K then here we need more and more number of 4K blocks required. And namenode need to maintain all these 4K blocks of metadata.

If we use 64MB of block size then data will be load into only two blocks(64MB and 36MB).Hence the size of metadata is decreased.

Conclusion: To reduce the burden on namenode HDFS prefer 64MB or 128MB of block size. The default size of the block is 64MB in Hadoop 1.0 and it is 128MB in Hadoop 2.0.

Betthezul answered 2/7, 2015 at 5:32 Comment(0)
P
1

It has more to do with disk seeks of the HDD (Hard Disk Drives). Over time the disk seek time had not been progressing much when compared to the disk throughput. So, when the block size is small (which leads to too many blocks) there will be too many disk seeks which is not very efficient. As we make progress from HDD to SDD, the disk seek time doesn't make much sense as they are moving parts in SSD.

Also, if there are too many blocks it will strain the Name Node. Note that the Name Node has to store the entire meta data (data about blocks) in the memory. In the Apache Hadoop the default block size is 64 MB and in the Cloudera Hadoop the default is 128 MB.

Principally answered 20/10, 2013 at 7:53 Comment(12)
so you mean the underlying implementation of a 64MB block read is not broken down into many 4KB block reads from the disk? Does the disk support to read 64MB in 1 read? Please feel free to ask me for clarification if the question is not clear. Thanks.Writing
64MB HDFS block will be split into multiple 4KB blocks OS file system blocks.Principally
if 64MB HDFS block will be split into multiple 4KB blocks, what's the point of using 64MB HDFS block?Writing
To reduce load on the Node Server. Fewer blocks to track = few requests and less memory tracking blocks.Erythroblastosis
The current default size in Apache Hadoop is is 128 MB.Danelaw
So there is really no advantage of block size being 64 or 128 with regards to sequential access? Since each block may be split into multiple native file system blocks?Hylophagous
There is advantage as these 4k blocks are stored on the disk contiguously which means there is no additional seek between one 4k block to the nextKermitkermy
the default block size for hadoop 1 was 64 MB for hadoop 2 its 128 MBBoondocks
@Rags, U mean to say that if i have 64mb data in disk of slave node,8kb blocks of 64mb data would be formed in a sequential manner to cut down extra seek time. Is that correct?Systaltic
@Rags, If that is the case how will the file system save the data in sequential manner. Because that is not the way how data is saved in disk. Data would be scattered in disk.How does the sequential formation of data happen in disk. Could u give me clarity on this part. Bit confused?Systaltic
@Basil Paul, That is a very good question. The intent is to get contiguous blocks from the underlying file system. In production set up HDFS gets its own volumes so getting contiguous blocks is not an issue. If you mix up with other storage like mapreduce temp data etc, then the issue arises. How it is exactly managed I am not sure. You may have to open the code and see how it is managed.Kermitkermy
so, to summarize, minimize odds that a file can't be stored on a single block? And even if the records are stored not linearly, the whole block is loaded into memory so after the transfer, seek time disappears? But does that really help if the underlying fs uses 4k blocks? I guess less requests for blocks, eg asking for 64k once over a network is cheaper than asking for 4k 16 times by 15 round trips.Anthesis
V
1

Below is what the book "Hadoop: The Definitive Guide", 3rd edition explains(p45).

Why Is a Block in HDFS So Large?

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.

This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.

Vaudois answered 21/10, 2013 at 9:27 Comment(1)
Is it possible to store multiple small files (say file size of 1KB) and store it in a single 64MB block? If we could store multiple small files in a block - how the nth file in a block would be read - will the file pointer be seeked to that particular nth file offset location - or will it skip n-1 files before reading the nth file content?Illsorted
J
1
  1. If block size was set to less than 64, there would be a huge number of blocks throughout the cluster, which causes NameNode to manage an enormous amount of metadata.
  2. Since we need a Mapper for each block, there would be a lot of Mappers, each processing a piece bit of data, which isn't efficient.
Joeyjoffre answered 18/11, 2013 at 20:11 Comment(2)
I agree with (1), but not with (2). The framework could (by default) just have each mapper deal with multiple data blocks.Danelaw
Each mapper processes a split, not a block. Further more, even if a mapper is assigned a split of N blocks, the end of the split may be a partial record, causing the Record Reader (this is specific to each record reader, but generally true for the ones that come with Hadoop) to read the rest of the record from the next block. The point is that mappers often cross block boundaries.Erythroblastosis
F
1

The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.

Having a much smaller block size would cause seek overhead to increase.

Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.

Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.

See Google Research Publication: MapReduce http://research.google.com/archive/mapreduce.html

Fortress answered 3/6, 2015 at 13:21 Comment(1)
This was already mentioned in my answer. It would have been preferrable to add comments to my answer than to post an aswer that adds very little to prior answers.Danelaw

© 2022 - 2024 — McMap. All rights reserved.