How to gather disk usage on a storage system faster than just using "du"? [closed]
Asked Answered
F

1

7

I operates a Synology NAS device and the unit includes that over 600 users data.

The users backup data are tax accounting data. So, approximately one user's folder has 200,000 files.

I have to provide their backup data usage informations to each users, but since there are so many directories and files, the du command takes too long to execute.

Could someone provide me a way to check each user's disk usage in a faster way?

Friedlander answered 16/6, 2014 at 0:42 Comment(0)
G
7

There is no magic. In order to gather the disk usage, you'll have to traverse the file system. If you are looking for a method of just doing it at a file system level, that would be easy (just df -h for example)... but it sounds like you want it at a directory level within mount point.

You could perhaps run jobs in parallel on each directory. For example in bash:

for D in `ls -d */`
do
    du -s $D &
done

wait

But you are likely to be i/o bound, I think. Also, if you have a lot of top-level directories, this method might be... well... rather taxing since it doesn't have any kind of governing of max number of processes.

If you have GNU Parallel installed you can do something like:

ls -d */ | parallel du -s 

...which would be much better. parallel has a lot of nice features like grouping the output, governing the max processes, etc... and you can also pass in some parameters to tweak it some (although, like I mentioned earlier, you'll be i/o bound, so more processes is not better, in fact less than the default may be preferable).

The only other thought I have on this is to perhaps use disk quotas if that is really the point of what you are trying to do. There is a good tutorial here if you want to read about it.

Guelders answered 16/6, 2014 at 1:3 Comment(4)
If the data is stored on rotating media, running multiple requests in parallel is one of the worst things you can do for performance. Seeking is slow, and simultaneous requests cause a great deal of extra seeking.Octaviooctavius
Yep, depends on your situation. I have a clustered NAS I read off of, and parallel does in fact help quite a bit when you have a lot of files to read in that situation. Like anything, you shouldn't blindly implement it. Test it out, see if it actually helps or not before tossing it into production use.Guelders
One would think that to get the total usage of a subtree of the filesystem (excluding links) there would be some sort of summary information maintained and efficiently accessible. This would be similar to what happens at the partition level: df returns instantly the available space. Probably easier to implement at the partition level since it is only about maintaining a count of used blocks. At the filesystem level, this is also doable but is implementation dependent. Maybe it was left out of common filesystem implementations for performance reasons?Festoonery
nice for ls -d */ | parallel du -s !! I wonder if there's an easy command-line way to sum the output sizes from that command?Cape

© 2022 - 2024 — McMap. All rights reserved.