Java: reading strings from a random access file with buffered input
Asked Answered
T

7

7

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.

My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.

For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):

f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
  s = f.readline
  # do something with "s" here
}

It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:

  • RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
  • BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.

I've tried to use something like:

    FileInputStream fis = new FileInputStream(fileName);
    FileChannel fc = fis.getChannel();
    BufferedReader br = new BufferedReader(
            new InputStreamReader(
                    fis,
                    CHARSET_UTF8
            )
    );

... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.

Do I really have to implement it all by myself, i.e. a file readering interface which would:

  • allow me to get/set position in a file
  • buffer file reading operations
  • allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")

Is there a quicker way than implementing it all myself? Am I overseeing something?

Trismus answered 29/11, 2010 at 15:19 Comment(7)
RandomAccessFile is meant for binary data. Although it can store and retrieve UTF-8 strings with writeUTF/readUTF, as you've found, its readLine (and the DataInput interface's readLine in general) doesn't work on UTF-8.Rodomontade
Are you allowed to use openJDK 7 (beta) or a 3rd party lib such as Apache Commons IO?Nestornestorian
@Martijn: please post your OpenJDK 7 and Apache Commons IO solutions anyway. I'm curious, and probably other people are too.Eurhythmic
@Martijn Verburg: I can't use JDK 7, but any 3rd party libraries are welcome. Please answer, it's interesting :)Trismus
@Ken Bloom - I gave the Java 7 version a go and it's still pretty darn verbose and it actually failed at runtime with the latest openJDK build :(. The only advantage is that you could use multiple threads to read/write from the same file in parallel. I've posted it anyhow. I'll confess to not having looked up the commons file I/O stuff yet, I'd assume they had a simpler API than JDK 1.5/1.6, I'll take a look at that nextNestornestorian
@Ken Bloom, ah I see your Commons I/O solution - nice one.Nestornestorian
Some nice ideas on buffered RandomAccessFile are given here.Sd
E
6
import org.apache.commons.io.input.BoundedInputStream

FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
   new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);

If you didn't care about pos2, then you woundn't need Apache Commons IO.

Eurhythmic answered 29/11, 2010 at 15:57 Comment(1)
Thanks! It's a pity that original Java APIs won't include such a functionality, but at least we have a workaround like BoundedStream. I've also found out that Google's Guava includes utterly similar LimitedInputStream class.Trismus
T
6

I wrote this code to read utf-8 using randomaccessfiles

//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);

public CyclicBuffer(FileChannel channel) {
    this.channel = channel;
}

private int read() throws IOException {
    return channel.read(buffer);
}

/**
 * Returns the byte read
 *
 * @return byte read -1 - end of file reached
 * @throws IOException
 */
public byte get() throws IOException {
    if (buffer.hasRemaining()) {
        return buffer.get();
    } else {
        buffer.clear();
        int eof = read();
        if (eof == -1) {
            return (byte) eof;
        }
        buffer.flip();
        return buffer.get();
    }
}
}
//File: UTFRandomFileLineReader.java


public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;

public UTFRandomFileLineReader(FileChannel channel) {
    this.buffer = new CyclicBuffer(channel);
}

public String readLine() throws IOException {
    if (eof) {
        return null;
    }
    byte x = 0;
    temp.clear();

    while ((byte) -1 != (x = (buffer.get())) &amp;&amp; x != '\n') {
        if (temp.position() == temp.capacity()) {
            temp = addCapacity(temp);
        }
        temp.put(x);
    }
    if (x == -1) {
        eof = true;
    }
    temp.flip();
    if (temp.hasRemaining()) {
        return charset.decode(temp).toString();
    } else {
        return null;
    }
}

private ByteBuffer addCapacity(ByteBuffer temp) {
    ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
    temp.flip();
    t.put(temp);
    return t;
}

public static void main(String[] args) throws IOException {
    RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
            "r");
    UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
            .getChannel());
    int i = 1;
    while (true) {
        String s = reader.readLine();
        if (s == null)
            break;
        System.out.println("\n line  " + i++);
        s = s + "\n";
        for (byte b : s.getBytes(Charset.forName("utf-8"))) {
            System.out.printf("%x", b);
        }
        System.out.printf("\n");

    }
}
}
Thomsen answered 14/4, 2011 at 9:26 Comment(0)
N
1

For @Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here

Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.

/* 
 * Paths uses the default file system, note no exception thrown at this stage if 
 * file is missing
 */
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
    /*
     * newByteChannel is a SeekableByteChannel - this is the fun new construct that 
     * supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel 
     * you could read and write to that channel simultaneously with multiple threads.
     */
    fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
    fc.position(startPosition);
    while (fc.read(readBuffer) != -1)
    {
        readBuffer.rewind();
        System.out.println(Charset.forName(encoding).decode(readBuffer));
        readBuffer.flip();
    }
}
Nestornestorian answered 29/11, 2010 at 16:45 Comment(5)
How do you read a file one line at a time using NIO? Is such a thing even possible?Eurhythmic
I've fixed the sample code so the reading works - will figure out the best way to read a line (this is a useful exercise as I'm feeding back API 'issues' to the nio-dev mailing list)Nestornestorian
I see the answer here javakb.com/Uwe/Forum.aspx/java-programmer/7117/… -- use a java.util.Scanner to operate on the channelEurhythmic
That would work yes, I just realised I used the regular FileChannel example as opposed to the AsynchronousFileChannel example, so I've adjusted my comments above. It's all powerful stuff, but it still needs some higher level API abstractions to catch on I think.Nestornestorian
AFAICT, GreyCat still couldn't limit the reader to go no further than pos2Eurhythmic
E
0

Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.

Then create your BufferedReader using

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))

Then you can call readLine on the BufferedReader.

Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.

Eurhythmic answered 29/11, 2010 at 15:43 Comment(1)
It's not an option for me: I'm working with multiple gigabyte files on computers with limited memory.Trismus
S
0

I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.

UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.

The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.

Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.

If you assume your "bytes" are individual characters, try something like this:

FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...
Stag answered 29/11, 2010 at 15:49 Comment(3)
Note that readers skip characters, not bytes. This prevents ambiguity when working with unknown character sets - is it single byte or double byte? In your case, I assumed that it's all single byte characters, so "pos" = characters.Stag
I work with full unicode set - with multiple bytes per character in UTF8 encoding - there's no problem with that. FileReader has essentially the same interface as BufferedReader, but you propose to read full range of bytes into memory at once - but I can't do it, I'm working with multi-gigabyte ranges on machines with fairly limited RAM.Trismus
You don't have to do it that way, it's just convenient. The longer way is to use skip() to get to the correct point, then use read() to get single characters off the stream. Since there is no buffering in that method you would have a lot of control over the memory footprint. (You can still get the benefits of buffering by wrapping the FileReader with a BufferedReader. BufferedReader can be initialized to a very specific size if you need to limit the memory footprint.)Stag
Z
0

I'm late to the party here, but I ran across this problem in my own project.

After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.

After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:

FileDescriptor fd = raFile.getFD();
FileReader     fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);

Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().

The one thing I'm not sure about is whether UTF8 strings are handled correctly.

Zootechnics answered 13/7, 2014 at 15:1 Comment(2)
Problem with this is that it still does not solve keeping track of bytes read. I have a constraint that I need to return blocks of n bytes (e.g. 64K bytes). UTF8 strings will be handled correctly afaict from the JDK source. Reading source of RandomAccessFile is really enlightening; Oracle should be ashamed at that implementation of readLine, especially since they have a working correct one in BufferedReader... worse still, in all their wisdom they made it final so we can't even fix itJasminjasmina
Good point: this doesn't support stopping reads at a certain position. Seems like it would be simple to wrap this code to track bytes read, though. Also, I sympathize with your frustration at the RandomAccessFile implementation!Zootechnics
P
-1

The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.

The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.

Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say

    InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
    if (in.markSupported()) {
        in.skip(1024);
        in.read();
    }

It is not so complicated as you afraid.

Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

Purkey answered 29/11, 2010 at 15:30 Comment(1)
Well, I've read most of these "ideological differences" explanations in Javadocs, but, back to my original question: am I right that there's no simple way (like in 5-6-10 lines) to do exactly what I've demonstrated in script language?Trismus

© 2022 - 2024 — McMap. All rights reserved.