I'm reading sequential lines of characters from a text file. The encoding of the characters in the file might not be single-byte.
At certain points, I'd like to get the file position at which the next line starts, so that I can re-open the file later and return to that position quickly.
Questions
Is there an easy way to do both, preferably using standard Java libraries?
If not, what is a reasonable workaround?
Attributes of an ideal solution
An ideal solution would handle multiple character encodings. This includes UTF-8, in which different characters may be represented by different numbers of bytes. An ideal solution would rely mostly on a trusted, well-supported library. Most ideal would be the standard Java library. Second best would be an Apache or Google library. The solution must be scalable. Reading the entire file into memory is not a solution. Returning to a position should not require reading all prior characters in linear time.
Details
For the first requirement, BufferedReader.readLine()
is attractive. But buffering clearly interferes with getting a meaningful file position.
Less obviously, InputStreamReader
also can read ahead, interfering with getting the file position. From the InputStreamReader documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
The method RandomAccessFile.readLine()
reads a single byte per character.
Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
readLine()
also trim off trailing whitespace/new line characters so even if you were ok with ASCII only support, your offset would still be off – BlastoffBufferedReader.getLine()
andRandomAccessFile.readLine()
both read and strip the line terminator from the return value. However, I think the stripping just affects the return value, not the file position. – Leilaleilahline.length()
the computation would be off due to the stripped off terminators – BlastoffFileChannel.position()
orRandomAccessFile.getFilePointer()
. I think computing it withline.length()
alone would be problematic for some encodings, such as UTF-8. – Leilaleilahjava.nio.charset.CharsetDecoder
is what you need here. You can do all your buffering on the stream level, then implement your ownReader
over it that counts the bytes processed and feeds them into aCharsetDecoder
. – ChitchatInputStreamReader
to the private implementation classsun.nio.cs.StreamDecoder
. I was hoping for something simpler, though. – LeilaleilahCharset
class itself even has to support legacy bugs suggests that there are no easy solutions here that work for every possible character encoding. Reading text files is a surprisingly hard problem. – Chitchat