Buffered RandomAccessFile java
Asked Answered
L

6

25

RandomAccessFile is quite slow for random access to a file. You often read about implementing a buffered layer over it, but code doing this isn't possible to find online.

So my question is: would you guys who know any opensource implementation of this class share a pointer or share your own implementation?

It would be nice if this question would turn out as a collection of useful links and code about this problem, which I'm sure, is shared by many and never addressed properly by SUN.

Please, no reference to MemoryMapping, as files can be way bigger than Integer.MAX_VALUE.

Leflore answered 10/4, 2011 at 19:38 Comment(5)
Let me see if I understand, you mean that java.nio.MemoryByteBuffer is not good enough because it can only hold Integer.MAX_VALUE bytes. Is that so?Dona
That's around 2 gygabytes of memory in a buffer. How big is your file and how much memory do you have available?Dona
What/how do you want to buffer? Usually you are buffering a stream, but if you want to access an arbitrary point in a multi-gig file, what data exactly do you want to store? My guess is that the answer to that will give you your solution (e.g. "I always want to preload the 1K of data after the random point).Engelhardt
@edalorzo: yes, that's the problem. My files are tens of GIGs.Leflore
@Will: Yes, that's the most typical idea. A read-ahead kind of behavior. I have records that are composed of a header and some payload. So I can read ints, longs and shorts for the fields composing my header, and some of these fields contain the size of the chunks of payload that come next. So it's many read*() and some read(byte[])s. It's mostly header+payload scenario. The kind of implementation I have in mind is not that different from adding BufferInputStream kind of behavior.Leflore
C
15

You can make a BufferedInputStream from a RandomAccessFile with code like,

 RandomAccessFile raf = ...
 FileInputStream fis = new FileInputStream(raf.getFD());
 BufferedInputStream bis = new BufferedInputStream(fis);

Some things to note

  1. Closing the FileInputStream will close the RandomAccessFile and vice versa
  2. The RandomAccessFile and FileInputStream point to the same position, so reading from the FileInputStream will advance the file pointer for the RandomAccessFile, and vice versa

Probably the way you want to use this would be something like,

RandomAccessFile raf = ...
FileInputStream fis = new FileInputStream(raf.getFD());
BufferedInputStream bis = new BufferedInputStream(fis);

//do some reads with buffer
bis.read(...);
bis.read(...);

//seek to a a different section of the file, so discard the previous buffer
raf.seek(...);
bis = new BufferedInputStream(fis);
bis.read(...);
bis.read(...);
Ciro answered 14/12, 2013 at 6:7 Comment(2)
I took a similar approach, using the getFD method. But instead of building a BufferedInputStream, I built a FileReader and then a BufferedReader. That gives me access to a readLine method that is faster (and maybe more UTF friendly?) than the one provided by RandomAccessFile.Inebriate
@JeffTerrellPh.D. I tried BufferedReader and noticed that RandomAccessFile.getFilePointer method returns same position even after making multiple calls to BufferedReader.readLine() method. This is probably because BufferedReader might be advancing file pointer internally far ahead in a single call to readLine().Subsidence
D
13

Well, I do not see a reason not to use java.nio.MappedByteBuffer even if the files are bigger the Integer.MAX_VALUE.

Evidently you will not be allowed to define a single MappedByteBuffer for the whole file. But you could have several MappedByteBuffers accessing different regions of the file.

The definition of position and size in FileChannenel.map are of type long, which implies you can provide values over Integer.MAX_VALUE, the only thing you have to take care of is that the size of your buffer will not be bigger than Integer.MAX_VALUE.

Therefore, you could define several maps like this:

buffer[0] = fileChannel.map(FileChannel.MapMode.READ_WRITE,0,2147483647L);
buffer[1] = fileChannel.map(FileChannel.MapMode.READ_WRITE,2147483647L, Integer.MAX_VALUE);
buffer[2] = fileChannel.map(FileChannel.MapMode.READ_WRITE, 4294967294L, Integer.MAX_VALUE);
...

In summary, the size cannot be bigger than Integer.MAX_VALUE, but the start position can be anywhere in your file.

In the Book Java NIO, the author Ron Hitchens states:

Accessing a file through the memory-mapping mechanism can be far more efficient than reading or writing data by conventional means, even when using channels. No explicit system calls need to be made, which can be time-consuming. More importantly, the virtual memory system of the operating system automatically caches memory pages. These pages will be cached using system memory andwill not consume space from the JVM's memory heap.

Once a memory page has been made valid (brought in from disk), it can be accessed again at full hardware speed without the need to make another system call to get the data. Large, structured files that contain indexes or other sections that are referenced or updated frequently can benefit tremendously from memory mapping. When combined with file locking to protect critical sections and control transactional atomicity, you begin to see how memory mapped buffers can be put to good use.

I really doubt that you will find a third-party API doing something better than that. Perhaps you may find an API written on top of this architecture to simplify the work.

Don't you think that this approach ought to work for you?

Dona answered 11/4, 2011 at 15:29 Comment(4)
Good approach, but you should have overlapping buffers so that you can read records that are on a 2G boundary.Proconsulate
that is a possible solution and was going to ask in another question. an efficient way to wrap multiple mappedbytebuffers for big files. here i was more looking for a buffered approach, something like github.com/apache/cassandra/blob/trunk/src/java/org/apache/… or minddumped.blogspot.com/2009/01/…Leflore
minddumped.blogspot.com/2009/01/… is good! thanks marcorossiSmokedry
You can't close or expand when using that form of maped file, you could use larray. If you need to be able to expand the file in a portable way see my answer below.Dictation
S
4

Apache PDFBox project has a nice and tested BufferedRandomAccessFile class.
Licensed under the Apache License, Version 2.0

It is an optimized version of the java.io.RandomAccessFile class as described by Nick Zhang on JavaWorld.com. Based on jmzreader implementation and augmented to handle unsigned bytes.

The source code is here:

UPDATE 2024.01.24:

On May 2022 in a1ea618, BufferedRandomAccessFile was replaced by RandomAccessReadBufferedFile in the PDFBox project (PDFBOX-5434).
Same thing, a somewhat different implementation. See the source code here:

Saltish answered 30/3, 2021 at 15:25 Comment(3)
Thank you! This is super useful. You don't need to maven source (mvnrepository.com/artifact/org.apache.pdfbox/pdfbox) the entire pdfbox project (in fact BufferedRandomAccessFile and the package seem to have disappeared from project); just this one class is self-contained and for me it worked right away. Note that the final readLine() is supplemented with additional faster String getNextLine().Ravage
They replaced it with RandomAccessReadBufferedFile: github.com/apache/pdfbox/blob/3155b9e/io/src/main/java/org/… The implementation is somewhat different but can still be very useful.Saltish
Thank you. I took the one I found from the JavaWorld article and added EOFException to getNextLine(), in github.com/SensorsINI/jaer/blob/master/src/net/sf/jaer/util/… . It seems to work for playing big event camera recordings quite well on linux and windows: youtube.com/watch?v=I5jdMzXWrbU . I use 10MB buffers (but did not check different sizes).Ravage
P
2

RandomAccessFile is quite slow for random access to a file. You often read about implementing a buffered layer over it, but code doing this isn't possible to find online.

Well, it is possible to find online.
For one, the JAI source code in jpeg2000 has an implementation, as well as an even more non-encumbered impl at: http://www.unidata.ucar.edu/software/netcdf-java/

javadocs:

http://www.unidata.ucar.edu/software/thredds/v4.3/netcdf-java/v4.0/javadoc/ucar/unidata/io/RandomAccessFile.html

Prologize answered 23/1, 2012 at 23:40 Comment(2)
if your files are in the GB range you will certainly notice a speedup with memory mapped files. the buffered RandomAccessFile impl I mentioned is excellent for small files, and also low mem requirements. Memory mapped files take up lots of RAM to do their wizardry.Prologize
with the only problem that i have to depend on a whole library for a class. that's the problem. still, thanks for the links.Leflore
P
1

If you're running on a 64-bit machine, then memory-mapped files are your best approach. Simply map the entire file into an array of equal-sized buffers, then pick a buffer for each record as needed (ie, edalorzo's answer, however you want overlapping buffers so that you don't have records that span boundaries).

If you're running on a 32-bit JVM, then you're stuck with RandomAccessFile. However, you can use it to read a byte[] that contains your entire record, then use a ByteBuffer to retrieve individual values from that array. At worst you should need to make two file accesses: one to retrieve the position/size of the record, and one to retrieve the record itself.

However, be aware that you can start stressing the garbage collector if you create lots of byte[]s, and you'll remain IO-bound if you bounce all over the file.

Proconsulate answered 11/4, 2011 at 16:4 Comment(7)
@Proconsulate I am certainly not an expert on the subject and therefore I feel really intrigued on why you say that if it is 64-bit machine memory-mapped files are the best approach. Do you say it because of the memory addressing limitations of a 32-bit hardware architecture or any other particular reason?Dona
@edalorzo - it's due to the limitations of 32-bit hardware. On a 64-bit machine your virtual address space is large enough to map the entire file. On a 32-bit machine you'd have to constantly remap potions of the file, and you may run into GC issues (mapped files are unmapped by the garbage collector, which should unmap one file so that you have room to map another, but may do a full collection while doing so).Proconsulate
yes, i was exactly looking for something like your 32-bit solution. look at my comment to edalorzo. the first one is kind of a problem mmmapping many different locations for small reads (compared to the size and cost of mmapping) doesn't make much sense.Leflore
@marcorossi: you wouldn't map portions of the file when you read them, you'd map the whole file. This might help you: kdgcommons.svn.sourceforge.net/viewvc/kdgcommons/trunk/src/main/…Ixia
@kdgregory: that looks interestin, though i can't memorymap 100+GIG files. Plus, how do you handle overlapping data between buffers? It doesn't seam like you handle that case.Leflore
@marcorossi: why can't you memory map 100+ G? If you have a 64-bit processor, OS, and JVM, you should have no problems. As for the overlapping buffers, that's the whole point of that class: you get up to 1Gb overlap between buffers.Ixia
Updated link to MappedFileBuffer: sourceforge.net/p/kdgcommons/code/HEAD/tree/trunk//src/main/…Shotten
D
0
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;

/**
 * Adds caching to a random access file.
 * 
 * Rather than directly writing down to disk or to the system which seems to be
 * what random access file/file channel do, add a small buffer and write/read from
 * it when possible. A single buffer is created, which means reads or writes near 
 * each other will have a speed up. Read/writes that are not within the cache block 
 * will not be speed up. 
 * 
 *
 */
public class BufferedRandomAccessFile implements AutoCloseable {

    private static final int DEFAULT_BUFSIZE = 4096;

    /**
     * The wrapped random access file, we will hold a cache around it.
     */
    private final RandomAccessFile raf;

    /**
     * The size of the buffer
     */
    private final int bufsize;

    /**
     * The buffer.
     */
    private final byte buf[];


    /**
     * Current position in the file.
     */
    private long pos = 0;

    /**
     * When the buffer has been read, this tells us where in the file the buffer
     * starts at.
     */
    private long bufBlockStart = Long.MAX_VALUE;


    // Must be updated on write to the file
    private long actualFileLength = -1;

    boolean changeMadeToBuffer = false;

    // Must be update as we write to the buffer.
    private long virtualFileLength = -1;

    public BufferedRandomAccessFile(File name, String mode) throws FileNotFoundException {
        this(name, mode, DEFAULT_BUFSIZE);
    }

    /**
     * 
     * @param file
     * @param mode how to open the random access file.
     * @param b size of the buffer
     * @throws FileNotFoundException
     */
    public BufferedRandomAccessFile(File file, String mode, int b) throws FileNotFoundException {
        this(new RandomAccessFile(file, mode), b);
    }

    public BufferedRandomAccessFile(RandomAccessFile raf) throws FileNotFoundException {
        this(raf, DEFAULT_BUFSIZE);
    }

    public BufferedRandomAccessFile(RandomAccessFile raf, int b) {
        this.raf = raf;
        try {
            this.actualFileLength = raf.length();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        this.virtualFileLength = actualFileLength;
        this.bufsize = b;
        this.buf = new byte[bufsize];
    }

    /**
     * Sets the position of the byte at which the next read/write should occur.
     * 
     * @param pos
     * @throws IOException
     */
    public void seek(long pos) throws IOException{
        this.pos = pos;
    }

    /**
     * Sets the length of the file.
     */
    public void setLength(long fileLength) throws IOException {
        this.raf.setLength(fileLength);
        if(fileLength < virtualFileLength) {
            virtualFileLength = fileLength;
        }
    }

    /**
     * Writes the entire buffer to disk, if needed.
     */
    private void writeBufferToDisk() throws IOException {
        if(!changeMadeToBuffer) return;
        int amountOfBufferToWrite = (int) Math.min((long) bufsize, virtualFileLength - bufBlockStart);
        if(amountOfBufferToWrite > 0) {
            raf.seek(bufBlockStart);
            raf.write(buf, 0, amountOfBufferToWrite);
            this.actualFileLength = virtualFileLength;
        }
        changeMadeToBuffer = false;
    }

    /**
     * Flush the buffer to disk and force a sync.
     */
    public void flush() throws IOException {
        writeBufferToDisk();
        this.raf.getChannel().force(false);
    }

    /**
     * Based on pos, ensures that the buffer is one that contains pos
     * 
     * After this call it will be safe to write to the buffer to update the byte at pos,
     * if this returns true reading of the byte at pos will be valid as a previous write
     * or set length has caused the file to be large enough to have a byte at pos.
     * 
     * @return true if the buffer contains any data that may be read. Data may be read so long as
     * a write or the file has been set to a length that us greater than the current position.
     */
    private boolean readyBuffer() throws IOException {
        boolean isPosOutSideOfBuffer = pos < bufBlockStart || bufBlockStart + bufsize <= pos;

        if (isPosOutSideOfBuffer) {

            writeBufferToDisk();

            // The buffer is always positioned to start at a multiple of a bufsize offset.
            // e.g. for a buf size of 4 the starting positions of buffers can be at 0, 4, 8, 12..
            // Work out where the buffer block should start for the given position. 
            long bufferBlockStart = (pos / bufsize) * bufsize;

            assert bufferBlockStart >= 0;

            // If the file is large enough, read it into the buffer.
            // if the file is not large enough we have nothing to read into the buffer,
            // In both cases the buffer will be ready to have writes made to it.
            if(bufferBlockStart < actualFileLength) {
                raf.seek(bufferBlockStart);
                raf.read(buf);
            }

            bufBlockStart = bufferBlockStart;
        }

        return pos < virtualFileLength;
    }

    /**
     * Reads a byte from the file, returning an integer of 0-255, or -1 if it has reached the end of the file.
     * 
     * @return
     * @throws IOException 
     */
    public int read() throws IOException {
        if(readyBuffer() == false) {
            return -1;
        }
        try {
            return (buf[(int)(pos - bufBlockStart)]) & 0x000000ff ; 
        } finally {
            pos++;
        }
    }

    /**
     * Write a single byte to the file.
     * 
     * @param b
     * @throws IOException
     */
    public void write(byte b) throws IOException {
        readyBuffer(); // ignore result we don't care.
        buf[(int)(pos - bufBlockStart)] = b;
        changeMadeToBuffer = true;
        pos++;
        if(pos > virtualFileLength) {
            virtualFileLength = pos;
        }
    }

    /**
     * Write all given bytes to the random access file at the current possition.
     * 
     */
    public void write(byte[] bytes) throws IOException {
        int writen = 0;
        int bytesToWrite = bytes.length;
        {
            readyBuffer();
            int startPositionInBuffer = (int)(pos - bufBlockStart);
            int lengthToWriteToBuffer = Math.min(bytesToWrite - writen, bufsize - startPositionInBuffer);
            assert  startPositionInBuffer + lengthToWriteToBuffer <= bufsize;

            System.arraycopy(bytes, writen,
                            buf, startPositionInBuffer,
                            lengthToWriteToBuffer);
            pos += lengthToWriteToBuffer;
            if(pos > virtualFileLength) {
                virtualFileLength = pos;
            }
            writen += lengthToWriteToBuffer;
            this.changeMadeToBuffer = true;
        }

        // Just write the rest to the random access file
        if(writen < bytesToWrite) {
            writeBufferToDisk();
            int toWrite = bytesToWrite - writen;
            raf.write(bytes, writen, toWrite);
            pos += toWrite;
            if(pos > virtualFileLength) {
                virtualFileLength = pos;
                actualFileLength = virtualFileLength;
            }
        }
    }

    /**
     * Read up to to the size of bytes,
     * 
     * @return the number of bytes read.
     */
    public int read(byte[] bytes) throws IOException {
        int read = 0;
        int bytesToRead = bytes.length;
        while(read < bytesToRead) {

            //First see if we need to fill the cache
            if(readyBuffer() == false) {
                //No more to read;
                return read;
            }

            //Now read as much as we can (or need from cache and place it
            //in the given byte[]
            int startPositionInBuffer = (int)(pos - bufBlockStart);
            int lengthToReadFromBuffer = Math.min(bytesToRead - read, bufsize - startPositionInBuffer);

            System.arraycopy(buf, startPositionInBuffer, bytes, read, lengthToReadFromBuffer);

            pos += lengthToReadFromBuffer;
            read += lengthToReadFromBuffer;
        }

        return read;
    }

    public void close() throws IOException {
        try {
            this.writeBufferToDisk();
        } finally {
            raf.close();
        }
    }

    /**
     * Gets the length of the file.
     * 
     * @return
     * @throws IOException
     */
    public long length() throws IOException{
        return virtualFileLength;
    }

}
Dictation answered 9/3, 2020 at 4:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.