I have a very big storage disk (16T). I want to run 'du' on it to figure out how much each subdirectory takes. However, that takes a very long time. Luckily, I have at my disposal a cluster of computers. I can therefore run 'du' on parallel, each job runs on a separate subdirectory, and write a simple script that does that. Is there already such a thing or must I write it myself?
It is simple to do it using GNU Parallel:
parallel du ::: */*
:::
incantation does, search for "::: arguments" in the documentation: gnu.org/software/parallel/man.html: "Use arguments from the command line as input source instead of stdin (standard input). Unlike other options for GNU parallel ::: is placed after the command and before the arguments." –
Enrollee Is there already such a thing or must I write it myself?
I wrote sn
for myself, but you might appreciate it too.
sn p .
will give you sizes of everything in the current directory. It runs in parallel and is faster than du
on large directories.
sn o -n30
puts 123GB directory below a 251MB one. :( Seems that the sorting does not respect the humanised format. –
Maki It is not clear from your question how your storage is designed (RAID array, NAS, NFS or something else).
But, almost regardless of actual technology, running du
in parallel may not be such a good idea after all - it is very likely to actually slow things down.
Disk array has limited IOPS capacity, and multiple du
threads will all take from that pool.
Even worse, often single du
slows down any other IO operations many times, even if du process does not consume a lot of disk throughput.
By comparison, if you have just single CPU, running parallel make (make -j N
) will slow down build process because process switching has considerable overhead.
Same principle is applicable to disks, especially to spinning disks. The only situation when you will gain considerable speed increase is when you have N drives mounted in independent directories (something like /mnt/disk1
, /mnt/disk2
, ..., /mnt/diskN
). In such case, you should run du
in N threads, 1 per disk.
One common improvement to increase du speed is to mount your disks with noatime
flag.
Without this flag, massive disk scanning creates a lot of write activity to update access time. If you use noatime
flag, write activity is avoided, and du works much faster.
mount
and df
to check where and how directory that has 16TB drive is mounted. –
Commemorate du
in single thread over whole array took just 12 minutes. –
Commemorate ls-lR
file - that should be very easy and quick operation. –
Commemorate © 2022 - 2024 — McMap. All rights reserved.