Linux: writes are split into 512K chunks
Asked Answered
H

5

6

I have a user-space application that generates big SCSI writes (details below). However, when I'm looking at the SCSI commands that reach the SCSI target (i.e. the storage, connected by the FC) something is splitting these writes into 512K chunks.

The application basically does 1M-sized direct writes directly into the device:

fd = open("/dev/sdab", ..|O_DIRECT);
write(fd, ..., 1024 * 1024);

This code causes two SCSI WRITEs to be sent, 512K each.

However, if I issue a direct SCSI command, without the block layer, the write is not split. I issue the following command from the command line:

sg_dd bs=1M count=1 blk_sgio=1 if=/dev/urandom of=/dev/sdab oflag=direct

I can see one single 1M-sized SCSI WRITE.

The question is, what is splitting the write and, more importantly, is it configurable? Linux block layer seems to be guilty (because SG_IO doesn't pass through it) and 512K seems too arbitrary a number not to be some sort of a configurable parameter.

Headcheese answered 8/5, 2012 at 7:41 Comment(0)
B
1

The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).

You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.

Bougainville answered 14/7, 2012 at 20:59 Comment(0)
D
4

As described in an answer to the "Why is the size of my IO requests being limited, to about 512K" Unix & Linux Stack Exchange question and the "Device limitations" section of the "When 2MB turns into 512KB" document by kernel block layer maintainer Jens Axboe, this can be because your device and kernel have size restrictions (visible in /sys/block/<disk>/queue/):

  • max_hw_sectors_kb maximum size of a single I/O the hardware can accept
  • max_sectors_kb the maximum size the block layer will send
  • max_segment_size and max_segments the DMA engine limitations for scatter gather (SG) I/O (maximum size of each segment and the maximum number of segments for a single I/O)

The segment restrictions matter a lot when the buffer the I/O is coming from is not contiguous and in the worst case each segment can be as small as page (which is 4096 bytes on x86 platforms). This means SG I/O for one I/O can be limited to a size of 4096 * max_segments.

The question is, what is splitting the write

As you guessed the Linux block layer.

and, more importantly, is it configurable?

You can fiddle with max_sectors_kb but the rest is fixed and come from device/driver restrictions (so I'm going to guess in your case probably not but you might see bigger I/O directly after a reboot due to less memory fragmentation).

512K seems too arbitrary a number not to be some sort of a configurable parameter

The value is likely related to fragment SG buffers. Let's assume you're on an x86 platform and have a max_segments of 128 so:

4096 * 128 / 1024 = 512

and that's where 512K could come from.

Bonus chatter: according to https://twitter.com/axboe/status/1207509190907846657 , if your device uses an IOMMU rather than a DMA engine then you shouldn't be segment limited...

Donets answered 19/12, 2019 at 4:26 Comment(0)
U
1

There's a max sectors per request attribute of the block driver. I'd have to check how to modify it. You used to be able to get this value via blockdev --getmaxsect but I'm not seeing the --getmaxsect option on my machine's blockdev.

Unstopped answered 10/6, 2012 at 19:24 Comment(0)
B
1

The blame is indeed on the block layer, the SCSI layer itself has little regard to the size. You should check though that the underlying layers are indeed able to pass your request, especially with regard to direct io since that may be split into many small pages and requires a scatter-gather list that is longer than what can be supported by the hardware or even just the drivers (libata is/was somewhat limited).

You should look and tune /sys/class/block/$DEV/queue there are assorted files there and the most likely to match what you need is max_sectors_kb but you can just try it out and see what works for you. You may also need to tune the partitions variables as well.

Bougainville answered 14/7, 2012 at 20:59 Comment(0)
T
0

Looking at the following files should tell you if the logical block size is different, possibly 512 in your case. I am not however sure if you can write to these files to change those values. (the logical block size that is)

/sys/block/<disk>/queue/physical_block_size 
/sys/block/<disk>/queue/logical_block_size
Tenement answered 14/5, 2012 at 17:47 Comment(1)
No, block sizes are not relevant -- it splits the data to 512-kilo byte chunks, not 512-byte ones.Headcheese
G
0

try ioctl(fd, BLKSECTSET, &blocks)

Georginageorgine answered 16/2, 2013 at 2:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.