Assuming that you're looking to extract a pseudo-random character from a UTF-8 file, I personally would lean away from trying to work out how to jump into a random place then scroll forwards to a guaranteed 'start of character' position (which my feeling is would be a tricky proposition) edit this is wrong. How about something like:
- Establish the length of the file in bytes
- Heuristically guess the number of characters - for example, by scaling by a constant established from some suitable corpus; or by examining the first
n
bytes and seeing how many characters they describe, in order to get a scaling constant that might be more representative of this file
- Pick a pseudo-random number in
1..<guessed number of characters in file>
- If the file is very big (which I'm guessing it must be, else you wouldn't be asking this), use a buffered read to:
- Read the file's bytes, decoding to UTF-8, until you reach the desired character. If you fall off the end of the file, use the last
A buffered read here will need to use two buffers which are alternately 'first' to avoid losing context when a character's bytes are split across two reads, eg:
Read Buffer A : bytes 1000-1999
Read Buffer B : bytes 2000-2999
If a character occupies bytes 1998-2001
, using a single buffer would lose context.
Read Buffer A : bytes 3000-3999
Now in effect buffer A follows buffer B when we convert the byte stream into characters.
As noted by @jleedev below, and as seen in the other answer, it is actually easy and safe to 'scroll forward' to a guaranteed character start. But the character count estimation stuff above might still prove useful.