How do I accomplish random reads of a UTF8 file

Asked 8/2, 2011 at 15:35 Answered 9/2, 2011 at 14:32

Solved c#unicode utf-8 utf-16 utf8-decode

My understanding is that reads to a UTF8 or UTF16 Encoded file can't necessarily be random because of the occasional surrogate byte (used in Eastern languages for example).

How can I use .NET to skip to an approximate position within the file, and read the unicode text from a semi-random position?

Do I discard surrogate bytes and wait for a word break to continue reading? If so, what are the valid word breaks I should wait for until I start the decoding?

Bipetalous answered 8/2, 2011 at 15:35 Comment(0)

Easy, UTF-8 is self-synchronizing.
Simply jump to random byte in a file and skip-read all bytes with leading bits 10 (continuation bytes). The first byte that does not have leading 10 is the starting byte of a proper UFT-8 character and you can read the following bytes using a regular UTF-8 encoding.

Coprolite answered 8/2, 2011 at 16:55 Comment(1)

UTF-16 is self-synchronizing as well unless you jump to an odd byte position. The Unicode encodings were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units. – Girdle 9/2, 2011 at 14:28

Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:

Establish the length of the file in bytes
Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first n bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
Pick a pseudo-random number in 1..<guessed number of characters in file>
If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last

A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:

Read Buffer A : bytes 1000-1999 Read Buffer B : bytes 2000-2999

If a character occupies bytes 1998-2001, using a single buffer would lose context.

Read Buffer A : bytes 3000-3999

Now in effect buffer A follows buffer B when we convert the byte stream into characters.

As noted by @jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.

Glib answered 8/2, 2011 at 16:34 Comment(1)

UTF-8 is specifically designed so that if you jump around you can easily find the beginning of a character. – Pubescent 8/2, 2011 at 16:45

For UTF-16, you always have to jump to an even byte position. Then you can check whether a trailing surrogate follows. If so, skip it, otherwise you are at the start of a well-formed UTF-16 code unit sequence (always assuming that the file is well-formed, of course).

The Unicode encodings UTF-8 and UTF-16 were specifically designed to be self-synchronizing, and there are strong guarantees that you only have to skip at most a small number of code units.

Girdle answered 9/2, 2011 at 14:32 Comment(0)

Recommended topics

Hot tags