File Streaming in Java

Asked 18/1, 2011 at 20:11 Answered 19/1, 2011 at 22:2

I'm currently developing 3D graphics application using JOGL (Java OpenGL binding). In brief, I have a huge landscape binary file. Due to its size, I have to stream terrain chunks in the run-time. Therefore, we explicitly see the random access concern. I have already finished the first (and dirty :)) implementation (perhaps it is multi-threaded), where I'm using a foolish approach... Here is the initialization of it:

dataInputStream = new DataInputStream(new BufferedInputStream(fileInputStream,4 * 1024);
dataInputStream.mark(dataInputStream.available());

And when I need to read (stream) special chunk (I already know its "offset" in the file) I'm performing the following (shame on me :)):

dataInputStream.reset();
dataInputStream.skipBytes(offset);
dataInputStream.read(whatever I need...);

Since I had little experience that was the first thing I could think about :) So, until now I have read 3 useful and quite interesting articles (I'm suggesting you to read them, perhaps if you are interested in this topic)

Byte Buffers and Non-Heap Memory - Mr. Gregory seems to be literate in Java NIO.
Java tip: How to read files quickly [http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly] - That's an interesting benchmark.
Articles: Tuning Java I/O Performance [http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/] - Simple Sun recommendations, but please scroll down and have a look at "Random Access" section there; they show a simple implementation of RandomAccessFile (RAF) with self-buffering improvement.

Mr. Gregory provides several *.java files in the end of his article. One of them is a benchmarking between FileChannel + ByteBuffer + Mapping (FBM) and RAF. He says that he noticed 4x speedup when using FBM compared to RAF. I have ran this benchmark in the following conditions:

The offset (e. g. place of access) is generated randomly (in the file scope, e. g. 0 - file.length());
File size is 220MB;
1 000 000 accesses (75% reads and 25% writes)

The results were stunning:

~ 28 sec for RAF! ~ 0.2 sec for FBM!

However, his implementation of RAF in this benchmark doesn't have self-buffering (the 3rd article tells about one), so I guess it is the "RandomAccessFile.seek" method calling, who drops performance so hard.

Ok, now after all those things I've learnt there is 1 question and 1 dilemma :)

Question: When we are mapping a file using "FileChannel.map" does Java copy the whole file contents into the MappedByteBuffer? Or does it just emulate it? If it copies, then using FBM approach is not suitable for my situation, is it?

Dilemma: Depends on your answers on the question...

If mapping copies a file, then it seems like I have only 2 possible solutions to go: RAF + self-buffering (the one from the 3rd article) or make use of position in FileChannel (not with mapping)... Which one would be better?
If mapping doesn't copy a file, then I have 3 options: two previous ones and FBM itself.

Edit: Here is one more question. Some of you here say that mapping doesn't copy file into MappedByteBuffer. Ok then, why can't I map 1GB file then, I'm getting "failed to map" message...

P. S. I would like to receive a fulfilled answer with advices, since I'm not able to find the consistent information over this topic in the internet.

Thanks :)

Osterhus answered 18/1, 2011 at 20:11 Comment(0)

No, the data is not buffered. A MappedByteBuffer references the data using a pointer. In other words, the data is not copied, it is simply mapped into physical memory. See the API docs if you haven't already.

A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk, but can also be a device, shared memory object, or other resource that the operating system can reference through a file descriptor. Once present, this correlation between the file and the memory space permits applications to treat the mapped portion as if it were primary memory.

Source: Wikipedia

If you are going to be reading data quite frequently, it is a good idea to at least cache some of it.

Tiresome answered 18/1, 2011 at 21:15 Comment(7)

If you say that MappedByteBuffer is a pointer to HD, then how does it reach so good results in benchmarking? The only possible speedup feature in IO that I personally know is TO ACCESS DISK AS LESS AS POSSIBLE and the only solution here is buffering. Again, if you are literate enough on this concern, please be more detailed. – Osterhus 18/1, 2011 at 21:29

@Haroogan I quote from that article: "the difference is almost entirely due to kernel context switches" – Tiresome 18/1, 2011 at 21:50

You must be joking by referring me to javadoc, aren't you? Coz, there is no particular information I'm asking for. I still haven't got any direct answers or proper ideas and comments on possible solutions. – Osterhus 18/1, 2011 at 22:41

@Haroogan First of all, open your eyes. My answer is sufficient, considering all you wanted to know was whether "mapping copies a file". The very first line of the javadoc says that the data is memory-mapped. You should have asked me what that meant instead of calling my answer a joke. The rest of my answer expounds on what it is, anyway. Moreover, I have given an additional suggestion on how to optimise. – Tiresome 19/1, 2011 at 16:16

It seems like you just don't get it. I'll try again with a primitive question. Just say yes or no. If I map 1GB file then MappedByteBuffer capacity = 1GB, so does MappedByteBuffer really occupy 1GB of RAM or it just emulates it? – Osterhus 19/1, 2011 at 17:41

My own experience tells me that it attempts to occupy 1GB of RAM memory, since I'm not able to map 1GB file with: "out of memory: map failed" exception! If I am wrong correct me, just stop referencing me to useless docs, all those words in javadocs are not backed-up with enough information. Moreover, javadocs is just a quick help, to tell you about right usage of class in Java, but this is not a guide which explains you what happens behind the scenes! The word MAPPING in javadocs tells me nothing about its real-life mechanism. I hope you got it now. – Osterhus 19/1, 2011 at 17:42

@Haroogan I do get it, it's just that I wanted to make sure you understood the concepts. It's hard to answer your question with a "yes" or "no", so I will link you to this article: en.wikipedia.org/wiki/Memory-mapped_file. It is detailed, and also addresses why you got an out-of-memory exception. – Tiresome 19/1, 2011 at 20:39

For a 220 MB file I would memory map the whole thing into virtual memory. The reason FBM is so fast is it doesn't actually read the data into memory, it just makes it available.

Note: when you run the test you need to compare like for like i.e. when the file is in the OS cache it will be much faster no matter how you do it. You need to repeat the test multiple times to get a reproduce able result.

Thane answered 18/1, 2011 at 20:35 Comment(5)

What do you mean by "available"? There can be only 2 options: the file is fully copied to the MappedByteBuffer (max size is 2GB for 32-bit systems) or MappedByteBuffer just emulates this file using background buffering, predicting logic or whatever... Since I have tried to map 1GB file and it failed to do so, I have to conclude that its mapping seems to copy the whole file to MappedByteBuffer... or am I still wrong? Please be more detailed in your answers. – Osterhus 18/1, 2011 at 20:41

When mapping, the OS maps the file into virtual memory. The pages (typically 4KB) of the file are brought into memory when you read/write them and are flush back to disk slowly. (Or when you force a flush) There is no way you can read a 220 MB file into memory in 0.2 seconds. I am not sure why a 1 GB file failed to be mapped unless you are using a 32-bit JVM. – Thane 19/1, 2011 at 8:38

Yep, I'm using 32-bit JVM, therefore I don't understand why 1GB file mapping fails... any ideas? Currently I'm interested only in reading, so I don't need flush and etc. You just said that OS is loading 4KB pages to virtual memory, but you see it's what I've said before, i. e. MappedByteBuffer just emulates this file using slow background buffering logic, which I can't control. Right? – Osterhus 19/1, 2011 at 17:43

A 32-bit JVM in a 32-bit OS can only use about 1.2 to 1.5 GB of virtual memory. A 32-bit JVM on a 64-bit OS can access more. On Solaris it can access 3.5 GB. The largest 64-bit JVM I have seenc an access 768 GB. Neither matches the theoretical limit but you can see that a 64-bit JVM is the right tool for the job. You can control where you access the file in what order, the amount of the file you can have in memory is limited by your hardware, the speed it can read the file is also limited by your hardware. – Thane 19/1, 2011 at 19:47

Java uses the underlying OS mapping. It does say how it does this because it depends on the OS you are using. If you want to know how your OS does it you need to read documentation for your OS. – Thane 19/1, 2011 at 19:50

Have you noticed that if you run a program, then close it, then run it again it starts up much faster than the second time? This happens because the OS has cached the parts of the files that were accessed in the first run, and doesn't need to access the disk for them. Memory mapping a file essentially allows a program access to these buffers, thus minimizing copies made when reading it. Note that memory mapping a file does not cause it to be read whole into memory; the bits and pieces that you read are read from disk on-demand. If the OS determines that there is low memory, it may decide to free up some parts of the mapped file from memory, and leave them on disk.

Edit: What you want is FileInputStream.getChannel().map(), then adapt that to an InputStream, then connect that to the DataInputStream.

Shultz answered 19/1, 2011 at 22:2 Comment(0)

Recommended topics

Hot tags