How to find the size of a HDFS file

M

7

38

How to find the size of a HDFS file? What command should be used to find the size of any file in HDFS.

Meretricious answered 20/7, 2012 at 7:2 Comment(0)

D

25

You can use hadoop fs -ls command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.

For e.g. command hadoop fs -ls input gives following output:

Found 1 items
-rw-r--r--   1 hduser supergroup      45956 2012-07-19 20:57 /user/hduser/input/sou

The size of file sou is 45956 bytes.

Disario answered 20/7, 2012 at 8:12 Comment(3)

How would you output the size in the human readable form? -ls - lah doesn't work here – Trommel 7/11, 2017 at 13:21

How can we check space usage per user xxxx? – Aerostation 15/5, 2020 at 17:16

@ivan_bilan - hadoop fs -ls -h works. Multiple options have to be specified separately, i.e. hadoop fs -ls -R -h for specifying recursive – Jade 4/6, 2021 at 14:0

M

37

I also find myself using hadoop fs -dus <path> a great deal. For example, if a directory on HDFS named "/user/frylock/input" contains 100 files and you need the total size for all of those files you could run:

hadoop fs -dus /user/frylock/input

and you would get back the total size (in bytes) of all of the files in the "/user/frylock/input" directory.

Also, keep in mind that HDFS stores data redundantly so the actual physical storage used up by a file might be 3x or more than what is reported by hadoop fs -ls and hadoop fs -dus.

Mosaic answered 20/7, 2012 at 10:25 Comment(5)

Additionally to the last point - the replication factor is the number shown after the permissions flags, and before the owner (2nd column in @adhunavkulkarni's answer) – Urinate 20/7, 2012 at 10:39

hadoop fs -du -s <path> for newer versions – Aubrette 30/11, 2013 at 16:51

Use hadoop fs -du -s -h /user/frylock/input for a much more readable output. – Sealey 11/12, 2015 at 23:23

@axiom it returns 10.5 G 31.6 G /path what are these 2 sizes – Renfroe 13/1, 2022 at 5:14

@DulangaHeshan - 10.5G is the actual (raw size of the file) and 31.6 G is the space it consumes on disk including replication which in this case is 3 ( 10.5 * 3) = ~31.6 G – Sapota 16/1, 2023 at 14:8

D

25

You can use hadoop fs -ls command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes.

For e.g. command hadoop fs -ls input gives following output:

Found 1 items
-rw-r--r--   1 hduser supergroup      45956 2012-07-19 20:57 /user/hduser/input/sou

The size of file sou is 45956 bytes.

Disario answered 20/7, 2012 at 8:12 Comment(3)

How would you output the size in the human readable form? -ls - lah doesn't work here – Trommel 7/11, 2017 at 13:21

How can we check space usage per user xxxx? – Aerostation 15/5, 2020 at 17:16

@ivan_bilan - hadoop fs -ls -h works. Multiple options have to be specified separately, i.e. hadoop fs -ls -R -h for specifying recursive – Jade 4/6, 2021 at 14:0

M

16

I used the below function which helped me to get the file size.

public class GetflStatus
{
    public long getflSize(String args) throws IOException, FileNotFoundException
    {
        Configuration config = new Configuration();
        Path path = new Path(args);
        FileSystem hdfs = path.getFileSystem(config);
        ContentSummary cSummary = hdfs.getContentSummary(path);
        long length = cSummary.getLength();
        return length;
    }
}

Mckinzie answered 18/3, 2014 at 16:31 Comment(2)

Can you please tell me if this returns 7906 then what is the size of that directory? Is it in bytes or in kbs? – Parisi 27/1, 2016 at 15:54

@Parisi It probably gives you in bytes – Asberry 23/10, 2018 at 8:38

P

9

See the command below with awk script to see the size (in GB) of filtered output in HDFS:

hadoop fs -du -s /data/ClientDataNew/**A***  | awk '{s+=$1} END {printf "%.3fGB\n", s/1000000000}'

output ---> 2.089GB

hadoop fs -du -s /data/ClientDataNew/**B***  | awk '{s+=$1} END {printf "%.3fG\n", s/1000000000}'

output ---> 1.724GB

hadoop fs -du -s /data/ClientDataNew/**C***  | awk '{s+=$1} END {printf  "%.3fG\n", s/1000000000}'

output ---> 0.986GB

Praenomen answered 10/5, 2016 at 14:44 Comment(0)

S

6

hdfs dfs -du -s -h /directory

This is the human readable version, otherwise it will give in bad units (slight bigger)

Sphenogram answered 5/2, 2019 at 19:31 Comment(1)

How can we check space usage per user or for user xxxx? – Aerostation 15/5, 2020 at 17:19

P

2

If you want to do it through the API, you can use 'getFileStatus()' method.

Pages answered 20/7, 2012 at 12:13 Comment(1)

It's not right it doesn't return file size it return allocated block size which won't be zero for empty files.The default is 67108864. – Tynes 30/11, 2014 at 6:23

C

1

In case if you want to know the each files size inside the directory then use the '*' asterisk at the end.

hadoop fs -du -s -h /tmp/output/*

I hope this helps your purpose.

Cyrenaic answered 31/8, 2021 at 8:52 Comment(0)

Recommended topics

Hot tags