Read lines of characters and get file position

Asked 3/6, 2015 at 18:11 Answered 29/5, 2018 at 14:33

I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.

At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.

Questions

Is there an easy way to do both, preferably using standard Java libraries?

If not, what is a reasonable workaround?

Attributes of an ideal solution

An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.

Details

For the first requirement, BufferedReader.readLine() is attractive. But buffering clearly interferes with getting a meaningful file position.

Less obviously, InputStreamReader also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:

To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.

The method RandomAccessFile.readLine() reads a single byte per character.

Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

Leilaleilah answered 3/6, 2015 at 18:11 Comment(8)

Do you need full unicode support? – Blastoff 3/6, 2015 at 19:22

As an FYI, I think most of the java classes that have readLine() also trim off trailing whitespace/new line characters so even if you were ok with ASCII only support, your offset would still be off – Blastoff 3/6, 2015 at 19:27

@Blastoff - Nothing more than what Java supports - 16-bit unicode characters, which IIUC is called the Basic Multilingual Plane. Yes, I see from the class documentation that both BufferedReader.getLine() and RandomAccessFile.readLine() both read and strip the line terminator from the return value. However, I think the stripping just affects the return value, not the file position. – Leilaleilah 3/6, 2015 at 19:42

@Andy_Thomas correct, however, if you were trying to compute file position by line.length() the computation would be off due to the stripped off terminators – Blastoff 3/6, 2015 at 19:44

@Blastoff - I'd expect to get the file position by FileChannel.position() or RandomAccessFile.getFilePointer(). I think computing it with line.length() alone would be problematic for some encodings, such as UTF-8. – Leilaleilah 3/6, 2015 at 19:49

I suspect java.nio.charset.CharsetDecoder is what you need here. You can do all your buffering on the stream level, then implement your own Reader over it that counts the bytes processed and feeds them into a CharsetDecoder. – Chitchat 8/6, 2015 at 21:10

@Chitchat - Yeah, I was looking at that an hour ago, after drilling down from InputStreamReader to the private implementation class sun.nio.cs.StreamDecoder. I was hoping for something simpler, though. – Leilaleilah 8/6, 2015 at 21:15

@AndyThomas The fact that the Charset class itself even has to support legacy bugs suggests that there are no easy solutions here that work for every possible character encoding. Reading text files is a surprisingly hard problem. – Chitchat 8/6, 2015 at 21:27

If you construct a BufferedReader from a FileReader and keep an instance of the FileReader accessible to your code, you should be able to get the position of the next line by calling:

fileReader.getChannel().position();

after a call to bufferedReader.readLine().

The BufferedReader could be constructed with an input buffer of size 1 if you're willing to trade performance gains for positional precision.

Alternate Solution What would be wrong with keeping track of the bytes yourself:

long startingPoint = 0; // or starting position if this file has been previously processed

while (readingLines) {
    String line = bufferedReader.readLine();
    startingPoint += line.getBytes().length;
}

this would give you the byte count accurate to what you've already processed, regardless of underlying marking or buffering. You'd have to account for line endings in your tally, since they are stripped.

Droplight answered 8/6, 2015 at 18:32 Comment(9)

A BufferedReader buffers its input. A call to BufferedReader.readLine() can read more into the buffer from the underlying FileReader than just the next line -- leaving the position past the position of the subsequent line. – Leilaleilah 8/6, 2015 at 18:39

The buffer size can be 0. It defeats the purpose of having a BufferedReader I suppose, except that it provides the convenience of not having to do the line parsing/character encoding logic yourself. And as you point out, even the InputStreamReader may be doing read ahead. – Droplight 8/6, 2015 at 18:42

does a buffer size of 1 give you a reasonable enough position? you can always subtract one from the position reported by the file channel. The worst thing that would do is reprocess a line terminator. It's starting to feel pretty hacky now though... – Droplight 8/6, 2015 at 18:46

That's an interesting idea. I like your insight about getting close rather than perfect. However, the underlying FileReader extends InputStreamReader, which has its own buffering. Imprecision could provide a position in the middle of the bytes for a character. (In addition, imprecision could be troublesome if blank lines were present and significant. They're not in my current use case.) – Leilaleilah 8/6, 2015 at 19:0

For the alternate solution, note that getBytes() does not necessarily provide bytes in the same encoding as the file. That said, there is a sibling that accepts a character encoding. One would need to have confidence that de-coding and re-encoding bytes resulted in the same number of bytes. I don't happen to know if that's guaranteed. (As a separate issue, not all files are guaranteed to have the same line terminator string on every line -- making it difficult to account for their length if stripped.) – Leilaleilah 8/6, 2015 at 19:33

you can use the NIO Files.newBufferedReader() to ensure that the same charset is used in both cases, but I'm starting to get the feeling that any readLine() based solution isn't going to work for you, and that you'll end up needing to read characters and account for line ending yourself. – Droplight 8/6, 2015 at 19:45

I suspect you're right, for solutions based on BufferedReader.readline(). Hopefully there's an alternate mechanism neither of us knows yet. Maybe my discomfort trying to compute file position from encoded and decoded bytes is misplaced. Anyway, thank you for your insight and effort, and for reading the question. – Leilaleilah 8/6, 2015 at 19:52

I'm awarding this answer the bounty for insight, thoughtfulness and tone, as the end of the bounty period approaches. Also see @biziclop's comments on the question. I also have a partial workaround that covers only ASCII and UTF-8 (see my answer to come). A general, easy solution is still desirable. – Leilaleilah 15/6, 2015 at 15:5

@Droplight FileReader has no getChannel. Besides FileInputStream cannot be converted into BufferedReader and using a FileInputStream.getChannel().position() doesn't advance file pointer at all(meaning you get the same position value everytime you call) – Anett 1/11, 2016 at 19:59

This partial workaround addresses only files encoded with 7-bit ASCII or UTF-8. An answer with a general solution is still desirable (as is criticism of this workaround).

In UTF-8:

All single-byte characters can be distinguished from all bytes in multi-byte characters. All the bytes in a multi-byte character have a '1' in the high-order position. In particular, the bytes representing LF and CR cannot be part of a multi-byte character.
All single-byte characters are in 7-bit ASCII. So we can decode a file containing only 7-bit ASCII characters with a UTF-8 decoder.

Taken together, those two points mean we can read a line with something that reads bytes, rather than characters, then decode the line.

To avoid problems with buffering, we can use RandomAccessFile. That class provides methods to read a line, and get/set the file position.

Here's a sketch of code to read the next line as UTF-8 using RandomAccessFile.

protected static String 
readNextLineAsUTF8( RandomAccessFile in ) throws IOException {
    String rv = null;
    String lineBytes = in.readLine();
    if ( null != lineBytes ) {
        rv = new String( lineBytes.getBytes(),
            StandardCharsets.UTF_8 );
    }
    return rv;
 }

Then the file position can be obtained from the RandomAccessFile immediately before calling that method. Given a RandomAccessFile referenced by in:

    long startPos = in.getFilePointer();
    String line = readNextLineAsUTF8( in );

Leilaleilah answered 15/6, 2015 at 16:21 Comment(0)

The case seems to be solved by VTD-XML, a library able to quickly parse big XML files:

The last java VTD-XML ximpleware implementation, currently 2.13 http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ provides some code maintaning a byte offset after each call to the getChar() method of its IReader implementations.

IReader implementations for various caracter encodings are available inside VTDGen.java and VTDGenHuge.java

IReader implementations are provided for the following encodings

ASCII; ISO_8859_1 ISO_8859_10 ISO_8859_11 ISO_8859_12 ISO_8859_13 ISO_8859_14 ISO_8859_15 ISO_8859_16 ISO_8859_2 ISO_8859_3 ISO_8859_4 ISO_8859_5 ISO_8859_6 ISO_8859_7 ISO_8859_8 ISO_8859_9 UTF_16BE UTF_16LE UTF8;
WIN_1250 WIN_1251 WIN_1252 WIN_1253 WIN_1254 WIN_1255 WIN_1256 WIN_1257 WIN_1258

Iatry answered 24/7, 2016 at 9:2 Comment(1)

Updating IReader with a getCharOffset() method and implementing it by adding a charCount member along to the offset member of the VTDGen and VTDGenHuge classes and by incrementing it upon each getChar() and skipChar() call of each IReader implementation might give you the solution. – Iatry 24/7, 2016 at 9:2

Initially, I found the approach suggested by Andy Thomas (https://mcmap.net/q/532664/-read-lines-of-characters-and-get-file-position) the most appropriate.

But unfortunately I couldn't succeed in converting the byte array (taken from RandomAccessFile.readLine) to correct string in cases when the file line contains non-latin characters.

So I reworked the approach by writing a function similar to RandomAccessFile.readLine itself that collects data from line not to a string, but to a byte array directly, and then construct the desired String from the byte array. So the following below code completely satisfied my needs (in Kotlin).

After calling the function, file.channel.position() will return the exact position of the next line (if any):

fun RandomAccessFile.readEncodedLine(charset: Charset = Charsets.UTF_8): String? {
    val lineBytes = ByteArrayOutputStream()
    var c = -1
    var eol = false

    while (!eol) {
        c = read()
        when (c) {
            -1, 10 -> eol = true // \n
            13     -> { // \r
                eol = true
                val cur = filePointer
                if (read() != '\n'.toInt()) {
                    seek(cur)
                }
            }
            else   -> lineBytes.write(c)
        }
    }

    return if (c == -1 && lineBytes.size() == 0)
        null
    else
        java.lang.String(lineBytes.toByteArray(), charset) as String
}

Inhalant answered 29/5, 2018 at 14:33 Comment(0)

I would suggest java.io.LineNumberReader. You can set and get the line number and therefore continue at a certain line index.

Since it is a BufferedReader it is also capable of handling UTF-8.

Salcedo answered 8/6, 2015 at 18:44 Comment(1)

I'm looking for a way to return to the position quickly, since I'm working with large files. Setting the file position is a constant time operation. Skipping lines has a linear cost. – Leilaleilah 8/6, 2015 at 18:47

Solution A

Use RandomAccessFile.readChar() or RandomAccessFile.readByte() in a loop.
Check for your EOL characters, then process that line.

The problem with anything else is that you would have to absolutely make sure you never read past the EOL character.

readChar() returns a char not a byte. So you do not have to worry about character width.

Reads a character from this file. This method reads two bytes from the file, starting at the current file pointer.

[...]

This method blocks until the two bytes are read, the end of the stream is detected, or an exception is thrown.

By using a RandomAccessFile and not a Reader you are giving up Java's ability to decode the charset in the file for you. A BufferedReader would do so automatically.

There are several ways of over coming this. One is to detect the encoding yourself and then use the correct read*() method. The other way would be to use a BoundedInput stream.

There is one in this question Java: reading strings from a random access file with buffered input

E.g. https://mcmap.net/q/539649/-java-reading-strings-from-a-random-access-file-with-buffered-input

Frightfully answered 8/6, 2015 at 19:14 Comment(8)

Is this equivalent to simply calling RandomAccessFile.getLine()? As noted above, encoding of the characters is not necessarily single byte. – Leilaleilah 8/6, 2015 at 19:26

Which part? readChar()? No, readChar() will always return a character, not a byte. With the 2nd solution, you can rewind so the only limit is your read limit. – Frightfully 8/6, 2015 at 20:4

Sorry, I missed readChar(). Unfortunately, the documentation and source code for readChar() both show it reading exactly two bytes, presumably UTF-16. That wouldn't work if the file used a single-byte encoding or UTF-8. – Leilaleilah 8/6, 2015 at 20:12

Java chars are always 2 bytes wide. UTF-8, UTF-16 or otherwises. UTF-8 is a variable width charset and can be 1 to 4 bytes in length. If you reach EOF, you can handle the exception and use the single byte as the char to be processed. – Frightfully 8/6, 2015 at 20:15

Java chars are two bytes. But the characters in the file are not necessarily in the same encoding as Java chars in memory. – Leilaleilah 8/6, 2015 at 20:31

I believe there is some confusion on how the read works. But if you want to test the encoding then using read(), you can do that as well. Also, separately you can use mark()/reset() api. – Frightfully 8/6, 2015 at 20:47

Mark/reset doesn't help because I want to "re-open the file later and return to that position." The RandomAccessFile.readChar() method reads a single, well-defined representation of a character from a file. However, there are other legal and prevalent representations that need to be covered by an answer to this question. – Leilaleilah 8/6, 2015 at 21:2

I think this question is a moving target. But that's ok. I'll put my final thoughts down on this. It's really don't too difficult, I don't think. – Frightfully 8/6, 2015 at 21:7

RandomAccessFile has a function: seek(long pos) Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.

Equestrienne answered 15/6, 2015 at 14:57 Comment(1)

Yes ... but the question is how do I get a file position ... when reading lines of characters of any of a number of possible character encodings. – Leilaleilah 15/6, 2015 at 15:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Questions

Attributes of an ideal solution

Details

Recommended topics

Hot tags