java get file size efficiently
Asked Answered
I

9

171

While googling, I see that using java.io.File#length() can be slow. FileChannel has a size() method that is available as well.

Is there an efficient way in java to get the file size?

Intoxicated answered 22/9, 2008 at 18:21 Comment(8)
can you provide the links saying that File.length() "can be slow"?Ozmo
sorry, here is the link javaperformancetuning.com/tips/rawtips.shtml search for "File information such as File.length() requires a system call and can be slow." it's really a confusing statement, it seems almost assumed that it would be a system call.Intoxicated
Getting the file length will require a system call no matter how you do it. It may be slow if its over a network or some other very slow filesystem. There is no faster way to get it than File.length(), and the definition of "slow" here just means don't call it unecessarily.Unnumbered
I think that's what GHad was trying to test below. My results are (On ubuntu 8.04): just one access URL is fastest. 5 runs, 50 iterations CHANNEL is fastest confusing yet? :) for my purposes though, I'll just be doing one access. though it's strange? that we got different resultsIntoxicated
This operation can be very slow if the information is on disk rather than in cache. (like 1000x slower) however, there is little you can do about this other than ensuring the information you need is always in cache (such as pre loading it and having enough memory so it stays in memory)Marty
I would question the validity of relying on a document that was already 8/9 year old by the time time this questions was asked, as a source for optimisation advice.Delmadelmar
There is more faster sample (only java-7) https://mcmap.net/q/16815/-get-size-of-folder-or-fileCanned
In the rare event that you're on Android, take a look at StatFs. It uses file system statistics and is nearly 1000x faster than recursive methods. Our implementation can be found here: https://mcmap.net/q/16816/-android-fast-and-efficient-way-of-finding-a-directory-sizeSubway
G
103

Well, I tried to measure it up with the code below:

For runs = 1 and iterations = 1 the URL method is fastest most times followed by channel. I run this with some pause fresh about 10 times. So for one time access, using the URL is the fastest way I can think of:

LENGTH sum: 10626, per Iteration: 10626.0

CHANNEL sum: 5535, per Iteration: 5535.0

URL sum: 660, per Iteration: 660.0

For runs = 5 and iterations = 50 the picture draws different.

LENGTH sum: 39496, per Iteration: 157.984

CHANNEL sum: 74261, per Iteration: 297.044

URL sum: 95534, per Iteration: 382.136

File must be caching the calls to the filesystem, while channels and URL have some overhead.

Code:

import java.io.*;
import java.net.*;
import java.util.*;

public enum FileSizeBench {

    LENGTH {
        @Override
        public long getResult() throws Exception {
            File me = new File(FileSizeBench.class.getResource(
                    "FileSizeBench.class").getFile());
            return me.length();
        }
    },
    CHANNEL {
        @Override
        public long getResult() throws Exception {
            FileInputStream fis = null;
            try {
                File me = new File(FileSizeBench.class.getResource(
                        "FileSizeBench.class").getFile());
                fis = new FileInputStream(me);
                return fis.getChannel().size();
            } finally {
                fis.close();
            }
        }
    },
    URL {
        @Override
        public long getResult() throws Exception {
            InputStream stream = null;
            try {
                URL url = FileSizeBench.class
                        .getResource("FileSizeBench.class");
                stream = url.openStream();
                return stream.available();
            } finally {
                stream.close();
            }
        }
    };

    public abstract long getResult() throws Exception;

    public static void main(String[] args) throws Exception {
        int runs = 5;
        int iterations = 50;

        EnumMap<FileSizeBench, Long> durations = new EnumMap<FileSizeBench, Long>(FileSizeBench.class);

        for (int i = 0; i < runs; i++) {
            for (FileSizeBench test : values()) {
                if (!durations.containsKey(test)) {
                    durations.put(test, 0l);
                }
                long duration = testNow(test, iterations);
                durations.put(test, durations.get(test) + duration);
                // System.out.println(test + " took: " + duration + ", per iteration: " + ((double)duration / (double)iterations));
            }
        }

        for (Map.Entry<FileSizeBench, Long> entry : durations.entrySet()) {
            System.out.println();
            System.out.println(entry.getKey() + " sum: " + entry.getValue() + ", per Iteration: " + ((double)entry.getValue() / (double)(runs * iterations)));
        }

    }

    private static long testNow(FileSizeBench test, int iterations)
            throws Exception {
        long result = -1;
        long before = System.nanoTime();
        for (int i = 0; i < iterations; i++) {
            if (result == -1) {
                result = test.getResult();
                //System.out.println(result);
            } else if ((result = test.getResult()) != result) {
                 throw new Exception("variance detected!");
             }
        }
        return (System.nanoTime() - before) / 1000;
    }

}
Gradygrae answered 22/9, 2008 at 19:21 Comment(13)
interesting, here are my results (ubuntu 8.04) LENGTH sum: 97442, per Iteration: 97442.0 CHANNEL sum: 15789, per Iteration: 15789.0 URL sum: 522, per Iteration: 522.0 LENGTH sum: 127074, per Iteration: 508.296 CHANNEL sum: 51582, per Iteration: 206.328 URL sum: 61334, per Iteration: 245.336Intoxicated
Seems like the URL way is the best one to go for single access whether its XP or linux. Greetz GHadGradygrae
stream.available() does not return the file length. It returns the amount of bytes which are available for read without blocking other streams. It is not necessarily the same amount of bytes as file length. To get the real length from a stream, you really need to read it (and count the read bytes meanwhile).Lowercase
Good point and you are right, but I never experienced any differnce for Files, as I expect that all bytes are readable, when I want to read a file this way. Well at least if the size is less than Integer.MAX_VALUEGradygrae
@Gradygrae then you're doing it wrong. There is nothing in the API that specifies that behaviour. You are relying on luck.Thar
This benchmark is or rather its interpretation is not correct. In the low iteration count the later tests take advantage of the file caching of the operating system. In the higher iterations test the ranking is correct but not because File.length() is caching something but simply because the other 2 options are based on the same method but do extra work that slows them down.Lights
I do not really think a system call could be cached, how would java know when the file size is changed?Piave
@Paolo, caching and optimising file system access is one of the major responsibilities of an OS. faqs.org/docs/linux_admin/buffer-cache.html To get good benchmarking results, the cache should be cleared before each run.Buehler
@Buehler in the answere it is said that java is caching system call, not that os is caching system calls.Piave
While these numbers are interesting to an extent, I'm not sure they're all that useful without a more thorough understanding of what exactly is happening at every step of the way. It's not a real use case and testing multiple ways of accessing information that requires reading from the disk in quick succession without clearing all possible caches between the code and the hard-drive is bound to be unduly influenced by unpredicted factors. Micro-benchmarks are ripe with pitfalls.Franci
Likc BalusC mentioned: stream.available() is flowed in this case. Because available() returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream.Paraboloid
Beyond what the javadoc for InputStream.available() says, the fact that the available() method returns an int should be a red flag against the URL approach. Try it with a 3GB file and it will be obvious that it is not a valid way to determine the file length.Thirddegree
Does it mean though, that file.length reads the whole file into memory as well? My question of speed, would be whether os stores file length as a parameter, or you need to load the whole file into jvm memory to get the size of it?Ameliaamelie
B
32

The benchmark given by GHad measures lots of other stuff (such as reflection, instantiating objects, etc.) besides getting the length. If we try to get rid of these things then for one call I get the following times in microseconds:

   file sum___19.0, per Iteration___19.0
    raf sum___16.0, per Iteration___16.0
channel sum__273.0, per Iteration__273.0

For 100 runs and 10000 iterations I get:

   file sum__1767629.0, per Iteration__1.7676290000000001
    raf sum___881284.0, per Iteration__0.8812840000000001
channel sum___414286.0, per Iteration__0.414286

I did run the following modified code giving as an argument the name of a 100MB file.

import java.io.*;
import java.nio.channels.*;
import java.net.*;
import java.util.*;

public class FileSizeBench {

  private static File file;
  private static FileChannel channel;
  private static RandomAccessFile raf;

  public static void main(String[] args) throws Exception {
    int runs = 1;
    int iterations = 1;

    file = new File(args[0]);
    channel = new FileInputStream(args[0]).getChannel();
    raf = new RandomAccessFile(args[0], "r");

    HashMap<String, Double> times = new HashMap<String, Double>();
    times.put("file", 0.0);
    times.put("channel", 0.0);
    times.put("raf", 0.0);

    long start;
    for (int i = 0; i < runs; ++i) {
      long l = file.length();

      start = System.nanoTime();
      for (int j = 0; j < iterations; ++j)
        if (l != file.length()) throw new Exception();
      times.put("file", times.get("file") + System.nanoTime() - start);

      start = System.nanoTime();
      for (int j = 0; j < iterations; ++j)
        if (l != channel.size()) throw new Exception();
      times.put("channel", times.get("channel") + System.nanoTime() - start);

      start = System.nanoTime();
      for (int j = 0; j < iterations; ++j)
        if (l != raf.length()) throw new Exception();
      times.put("raf", times.get("raf") + System.nanoTime() - start);
    }
    for (Map.Entry<String, Double> entry : times.entrySet()) {
        System.out.println(
            entry.getKey() + " sum: " + 1e-3 * entry.getValue() +
            ", per Iteration: " + (1e-3 * entry.getValue() / runs / iterations));
    }
  }
}
Betseybetsy answered 23/9, 2008 at 6:18 Comment(2)
actually, while you are correct in saying it measures other aspects, I should be more clearer in my question. I'm looking to get the file size of multiple files, and I want the quickest possible way. so i really do need to take into account object creation and overhead, since that is a real scenarioIntoxicated
About 90% of the time is spent in that getResource thing. I doubt you need to use reflection to get the name of a file that contains some Java bytecode.Betseybetsy
S
21

All the test cases in this post are flawed as they access the same file for each method tested. So disk caching kicks in which tests 2 and 3 benefit from. To prove my point I took test case provided by GHAD and changed the order of enumeration and below are the results.

Looking at result I think File.length() is the winner really.

Order of test is the order of output. You can even see the time taken on my machine varied between executions but File.Length() when not first, and incurring first disk access won.

---
LENGTH sum: 1163351, per Iteration: 4653.404
CHANNEL sum: 1094598, per Iteration: 4378.392
URL sum: 739691, per Iteration: 2958.764

---
CHANNEL sum: 845804, per Iteration: 3383.216
URL sum: 531334, per Iteration: 2125.336
LENGTH sum: 318413, per Iteration: 1273.652

--- 
URL sum: 137368, per Iteration: 549.472
LENGTH sum: 18677, per Iteration: 74.708
CHANNEL sum: 142125, per Iteration: 568.5
Sayyid answered 22/3, 2011 at 1:2 Comment(0)
B
9

When I modify your code to use a file accessed by an absolute path instead of a resource, I get a different result (for 1 run, 1 iteration, and a 100,000 byte file -- times for a 10 byte file are identical to 100,000 bytes)

LENGTH sum: 33, per Iteration: 33.0

CHANNEL sum: 3626, per Iteration: 3626.0

URL sum: 294, per Iteration: 294.0

Brunei answered 23/9, 2008 at 3:42 Comment(0)
D
9

In response to rgrig's benchmark, the time taken to open/close the FileChannel & RandomAccessFile instances also needs to be taken into account, as these classes will open a stream for reading the file.

After modifying the benchmark, I got these results for 1 iterations on a 85MB file:

file totalTime: 48000 (48 us)
raf totalTime: 261000 (261 us)
channel totalTime: 7020000 (7 ms)

For 10000 iterations on same file:

file totalTime: 80074000 (80 ms)
raf totalTime: 295417000 (295 ms)
channel totalTime: 368239000 (368 ms)

If all you need is the file size, file.length() is the fastest way to do it. If you plan to use the file for other purposes like reading/writing, then RAF seems to be a better bet. Just don't forget to close the file connection :-)

import java.io.File;
import java.io.FileInputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import java.util.HashMap;
import java.util.Map;

public class FileSizeBench
{    
    public static void main(String[] args) throws Exception
    {
        int iterations = 1;
        String fileEntry = args[0];

        Map<String, Long> times = new HashMap<String, Long>();
        times.put("file", 0L);
        times.put("channel", 0L);
        times.put("raf", 0L);

        long fileSize;
        long start;
        long end;
        File f1;
        FileChannel channel;
        RandomAccessFile raf;

        for (int i = 0; i < iterations; i++)
        {
            // file.length()
            start = System.nanoTime();
            f1 = new File(fileEntry);
            fileSize = f1.length();
            end = System.nanoTime();
            times.put("file", times.get("file") + end - start);

            // channel.size()
            start = System.nanoTime();
            channel = new FileInputStream(fileEntry).getChannel();
            fileSize = channel.size();
            channel.close();
            end = System.nanoTime();
            times.put("channel", times.get("channel") + end - start);

            // raf.length()
            start = System.nanoTime();
            raf = new RandomAccessFile(fileEntry, "r");
            fileSize = raf.length();
            raf.close();
            end = System.nanoTime();
            times.put("raf", times.get("raf") + end - start);
        }

        for (Map.Entry<String, Long> entry : times.entrySet()) {
            System.out.println(entry.getKey() + " totalTime: " + entry.getValue() + " (" + getTime(entry.getValue()) + ")");
        }
    }

    public static String getTime(Long timeTaken)
    {
        if (timeTaken < 1000) {
            return timeTaken + " ns";
        } else if (timeTaken < (1000*1000)) {
            return timeTaken/1000 + " us"; 
        } else {
            return timeTaken/(1000*1000) + " ms";
        } 
    }
}
Declaim answered 26/11, 2009 at 13:18 Comment(0)
U
8

I ran into this same issue. I needed to get the file size and modified date of 90,000 files on a network share. Using Java, and being as minimalistic as possible, it would take a very long time. (I needed to get the URL from the file, and the path of the object as well. So its varied somewhat, but more than an hour.) I then used a native Win32 executable, and did the same task, just dumping the file path, modified, and size to the console, and executed that from Java. The speed was amazing. The native process, and my string handling to read the data could process over 1000 items a second.

So even though people down ranked the above comment, this is a valid solution, and did solve my issue. In my case I knew the folders I needed the sizes of ahead of time, and I could pass that in the command line to my win32 app. I went from hours to process a directory to minutes.

The issue did also seem to be Windows specific. OS X did not have the same issue and could access network file info as fast as the OS could do so.

Java File handling on Windows is terrible. Local disk access for files is fine though. It was just network shares that caused the terrible performance. Windows could get info on the network share and calculate the total size in under a minute too.

--Ben

Uigur answered 2/4, 2011 at 3:25 Comment(0)
E
3

If you want the file size of multiple files in a directory, use Files.walkFileTree. You can obtain the size from the BasicFileAttributes that you'll receive.

This is much faster then calling .length() on the result of File.listFiles() or using Files.size() on the result of Files.newDirectoryStream(). In my test cases it was about 100 times faster.

Exhibitionist answered 23/1, 2014 at 12:0 Comment(1)
FYI, Files.walkFileTree is available on Android 26+.Subway
U
2

Actually, I think the "ls" may be faster. There are definitely some issues in Java dealing with getting File info. Unfortunately there is no equivalent safe method of recursive ls for Windows. (cmd.exe's DIR /S can get confused and generate errors in infinite loops)

On XP, accessing a server on the LAN, it takes me 5 seconds in Windows to get the count of the files in a folder (33,000), and the total size.

When I iterate recursively through this in Java, it takes me over 5 minutes. I started measuring the time it takes to do file.length(), file.lastModified(), and file.toURI() and what I found is that 99% of my time is taken by those 3 calls. The 3 calls I actually need to do...

The difference for 1000 files is 15ms local versus 1800ms on server. The server path scanning in Java is ridiculously slow. If the native OS can be fast at scanning that same folder, why can't Java?

As a more complete test, I used WineMerge on XP to compare the modified date, and size of the files on the server versus the files locally. This was iterating over the entire directory tree of 33,000 files in each folder. Total time, 7 seconds. java: over 5 minutes.

So the original statement and question from the OP is true, and valid. Its less noticeable when dealing with a local file system. Doing a local compare of the folder with 33,000 items takes 3 seconds in WinMerge, and takes 32 seconds locally in Java. So again, java versus native is a 10x slowdown in these rudimentary tests.

Java 1.6.0_22 (latest), Gigabit LAN, and network connections, ping is less than 1ms (both in the same switch)

Java is slow.

Uigur answered 17/11, 2010 at 7:40 Comment(1)
This also appears to be OS specific. Doing the same java app going after the same folder from OS X using samba it took 26 seconds to list the entire 33,000 items, sizes, and dates. So network Java is just slow on Windows then? (OS X was java 1.6.0_22 also.)Uigur
P
2

From GHad's benchmark, there are a few issue people have mentioned:

1>Like BalusC mentioned: stream.available() is flowed in this case.

Because available() returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream.

So 1st to remove the URL this approach.

2>As StuartH mentioned - the order the test run also make the cache difference, so take that out by run the test separately.


Now start test:

When CHANNEL one run alone:

CHANNEL sum: 59691, per Iteration: 238.764

When LENGTH one run alone:

LENGTH sum: 48268, per Iteration: 193.072

So looks like the LENGTH one is the winner here:

@Override
public long getResult() throws Exception {
    File me = new File(FileSizeBench.class.getResource(
            "FileSizeBench.class").getFile());
    return me.length();
}
Paraboloid answered 17/10, 2013 at 14:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.