How to find the size of a HDFS file? What command should be used to find the size of any file in HDFS.
You can use hadoop fs -ls
command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.
For e.g. command hadoop fs -ls input
gives following output:
Found 1 items
-rw-r--r-- 1 hduser supergroup 45956 2012-07-19 20:57 /user/hduser/input/sou
The size of file sou
is 45956 bytes.
I also find myself using hadoop fs -dus <path>
a great deal. For example, if a directory on HDFS named "/user/frylock/input" contains 100 files and you need the total size for all of those files you could run:
hadoop fs -dus /user/frylock/input
and you would get back the total size (in bytes) of all of the files in the "/user/frylock/input" directory.
Also, keep in mind that HDFS stores data redundantly so the actual physical storage used up by a file might be 3x or more than what is reported by hadoop fs -ls
and hadoop fs -dus
.
hadoop fs -du -s -h /user/frylock/input
for a much more readable output. –
Sealey 10.5 G 31.6 G /path
what are these 2 sizes –
Renfroe You can use hadoop fs -ls
command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.
For e.g. command hadoop fs -ls input
gives following output:
Found 1 items
-rw-r--r-- 1 hduser supergroup 45956 2012-07-19 20:57 /user/hduser/input/sou
The size of file sou
is 45956 bytes.
I used the below function which helped me to get the file size.
public class GetflStatus
{
public long getflSize(String args) throws IOException, FileNotFoundException
{
Configuration config = new Configuration();
Path path = new Path(args);
FileSystem hdfs = path.getFileSystem(config);
ContentSummary cSummary = hdfs.getContentSummary(path);
long length = cSummary.getLength();
return length;
}
}
See the command below with awk script to see the size (in GB) of filtered output in HDFS:
hadoop fs -du -s /data/ClientDataNew/**A*** | awk '{s+=$1} END {printf "%.3fGB\n", s/1000000000}'
output ---> 2.089GB
hadoop fs -du -s /data/ClientDataNew/**B*** | awk '{s+=$1} END {printf "%.3fG\n", s/1000000000}'
output ---> 1.724GB
hadoop fs -du -s /data/ClientDataNew/**C*** | awk '{s+=$1} END {printf "%.3fG\n", s/1000000000}'
output ---> 0.986GB
hdfs dfs -du -s -h /directory
This is the human readable version, otherwise it will give in bad units (slight bigger)
If you want to do it through the API, you can use 'getFileStatus()' method.
In case if you want to know the each files size inside the directory then use the '*' asterisk at the end.
hadoop fs -du -s -h /tmp/output/*
I hope this helps your purpose.
© 2022 - 2024 — McMap. All rights reserved.