Run du on parallel
Asked Answered
N

3

17

I have a very big storage disk (16T). I want to run 'du' on it to figure out how much each subdirectory takes. However, that takes a very long time. Luckily, I have at my disposal a cluster of computers. I can therefore run 'du' on parallel, each job runs on a separate subdirectory, and write a simple script that does that. Is there already such a thing or must I write it myself?

Nanna answered 7/7, 2014 at 8:0 Comment(1)
Just note that IO can't always be multi-tasked.Crellen
P
14

It is simple to do it using GNU Parallel:

parallel du ::: */*
Paludal answered 26/7, 2014 at 17:45 Comment(4)
If anyone is wondering what the magic ::: incantation does, search for "::: arguments" in the documentation: gnu.org/software/parallel/man.html: "Use arguments from the command line as input source instead of stdin (standard input). Unlike other options for GNU parallel ::: is placed after the command and before the arguments."Enrollee
Spend 15 minutes reading chapter 1+2 if you want to learn more: doi.org/10.5281/zenodo.1146014Paludal
Oh, that's great! Thank you for sharing this book! :)Enrollee
Note: IIUC, this will run du on each "second level directory" below the current level. So it could work well if your data distributed across those directories, but won't do much if (say) most of your data is in one of these subdirectories.Emilio
A
4

Is there already such a thing or must I write it myself?

I wrote sn for myself, but you might appreciate it too.

sn p .

will give you sizes of everything in the current directory. It runs in parallel and is faster than du on large directories.

Atlante answered 16/11, 2017 at 16:27 Comment(2)
Have you considered to apply to Homebrew and add your tool as an install recipe?Maki
Furthermore, executing sn o -n30 puts 123GB directory below a 251MB one. :( Seems that the sorting does not respect the humanised format.Maki
C
3

It is not clear from your question how your storage is designed (RAID array, NAS, NFS or something else).

But, almost regardless of actual technology, running du in parallel may not be such a good idea after all - it is very likely to actually slow things down.

Disk array has limited IOPS capacity, and multiple du threads will all take from that pool. Even worse, often single du slows down any other IO operations many times, even if du process does not consume a lot of disk throughput.

By comparison, if you have just single CPU, running parallel make (make -j N) will slow down build process because process switching has considerable overhead.

Same principle is applicable to disks, especially to spinning disks. The only situation when you will gain considerable speed increase is when you have N drives mounted in independent directories (something like /mnt/disk1, /mnt/disk2, ..., /mnt/diskN). In such case, you should run du in N threads, 1 per disk.

One common improvement to increase du speed is to mount your disks with noatime flag. Without this flag, massive disk scanning creates a lot of write activity to update access time. If you use noatime flag, write activity is avoided, and du works much faster.

Commemorate answered 7/7, 2014 at 8:18 Comment(7)
This is my university's storage, so I'm not familiar with the details. However, since this is a big disk/s whose purpose is to serve as the disk for a cluster (condor in this case), I am assuming it is designed to support multiple, if not many, IO operations at once.Nanna
How your client computers are using this storage? NFS mount? If yes, then parallel scan might work, because NFS has considerable network round-trip overheadCommemorate
Is there a way for me to check this myself (some command like to run)?Nanna
Assuming that your client computers are Linux or any other Unix-like systems, simple check would be to use mount and df to check where and how directory that has 16TB drive is mounted.Commemorate
Yep: ... type nfs (rw,nosuid,relatime,vers=3,rsize=16384,wsize=16384,namlen=255,soft,proto=tcp,port=2049,timeo=25,retrans=3,sec=sys,local_lock=none,addr=x.x.x.x)Nanna
You might have better luck if you could somehow get local access to that storage - NFS is notoriously slow in these situations. On my home server, I have 12TB RAID array with 8TB used (4 million files), and local du in single thread over whole array took just 12 minutes.Commemorate
One more thought - many sites/servers have server-side scripts that automatically create ls-lR files. If you have something like this present, all you need is to analyze ls-lR file - that should be very easy and quick operation.Commemorate

© 2022 - 2024 — McMap. All rights reserved.