Efficiently choosing a random line from a text file with uniform probability in C?
Asked Answered
B

3

9

This is essentially a more constrained version of this question.

Suppose we have a very large text file, containing a large number of lines.

We need to choose a line at random from the file, with uniform probability, but there are constraints:

  • Because this is a soft realtime application, we cannot iterate over the entire file. The choice should take a constant-ish amount of time.
  • Because of memory constraints, the file cannot be cached.
  • Because the file is permitted to change at runtime, the length of the file cannot be assumed to be a constant.

My first thought is to use an lstat() call to get the total filesize in bytes. fseek() can then be used to directly access a random byte offset, getting something like O(1) access into a random part of the file.

The problem is that we can't then do something like read to the next newline and call it a day, because that would produce a distribution biased toward long lines.

My first thought at solving this issue is to read until the first "n" newlines (wrapping back to the file's beginning if required), and then choose a line with uniform probability from this smaller set. It is safe to assume the file's contents are randomly ordered, so this sub-sample should be uniform with respect to length, and, since its starting point was selected uniformly from all possible points, it should represent a uniform choice from the file as a whole. So, in pseudo-C, our algorithm looks something like:

 lstat(filepath, &filestat);
 fseek(file, (int)(filestat.off_t*drand48()), SEEK_SET);
 char sample[n][BUFSIZ];
 for(int i=0;i<n;i++)
     fgets(sample[i], BUFSIZ, file); //plus some stuff to deal with file wrap around...
 return sample[(int)(n*drand48())];

This doesn't seem like an especially elegant solution, and I'm not completely confident it will be uniform, so I'm wondering if there's a better way to do it. Any thoughts?

EDIT: On further consideration, I'm now pretty sure that my method is not uniform, since the starting point is more likely to be inside longer words, and thus is not uniform. Tricky!

Braces answered 20/11, 2012 at 17:0 Comment(12)
what about doing 2 fseek()'s to 2 random places, and use the lines that immediately follow each of them.Kishke
@Nick Wouldn't that also be biased toward picking longer lines? Longer lines have more bytes, so isn't that distribution skewed to begin with? Making multiple draws from a skewed distribution doesn't address the skew.Braces
DO you have an estimate on how long the lines are in comparison to the whole filesize?Runoff
@SergeyL. Only that they will tend to be small fractions of the total size in most cases, but no firm promises. It is probably safe to assume some upper bound (e.g. BUFSIZ as I have done in the example) though. To clarify: I don't know the distribution of sizes, only the maximum length.Braces
I am more asking in terms of are there hundreds or tens of thousands of lines in the file?Runoff
@SergeyL. Ah, I see now. It can vary, but there are definitely use cases with tens of thousands. Probably not hundreds of thousands however.Braces
In this case I would do it exactly as you suggested. Just instead of reading it in line by line use mmap to map the file into memory. This will save you memory and avoid swapping. Non-dirty unused pages will be just removed since they are file-backed.Runoff
Is it feasible to build an index of the line beginnings? You said the file might change at run time. If the changes are limited to appending additional lines, then updating the index would be a trivial operation.Turd
@AdrianMcCarthy A good thought, but unfortunately not only can the file's entire contents be overwritten, but the targeted file itself can change at runtime!Braces
I think your requirements are contradictory — or close to it. You want uniform choice of lines when you don't know how many lines there are and without reading the whole file and when the file can change at any time (presumably including 'while you are processing it'). To get the lines with a uniform probability, the mechanism in the x-ref'd question will work (but things might be screwy if the file changes while you're reading it). If you can't read the whole file while it is stable, you'll never be able to be accurate. You can approximate away; lots of ways of doing that. But accurately?Decker
Too me it sounds like you should reconsider your data storage strategy. Having a large file that changes very frequently and needing a random access pattern just don't match up. The problem you are trying to solve is a symptom caused by a problem in the step before (which we are not seeing here). And I'm fully aware that you might have a legacy app that you can't change, but "Have you tried?" (To talk to whoever is generating the data?)Genoa
@Genoa So I glossed over some of the details, but basically in addition to being appended to or overwritten, the file itself can change at runtime. That is, the user can specify something like "After a some arbitrary number of seconds running, choose lines from this other file instead." The present solution is to use the method from the x-ref'd question, but in practice this is much to slow. As in, the application's runtime in practice is doubled by this function.Braces
C
2

Select a random character from the file (via rand and seek as you noted). Now, instead of finding the associated newline, since that is biased as you noted, I would apply the following algorithm:


Is the character a newline character?
   yes - use the preceeding line
   no  - try again

I can't see how this could give anything but a uniform distribution of lines. The efficiency depends on the average length of a line. If your file has relatively short lines, this could be workable, though if the file really can't be precached even by the OS, you might pay a heavy price in physical disk seeks.

Chess answered 21/11, 2012 at 20:11 Comment(2)
Makes sense to me. I think this ends up with identical performance to my answer though, since on average you need to draw total_bytes/#newlines characters, which is just the average line length in the system. It might be a little faster though because you avoid reading in a whole word until after you've selected it, the coefficients in its runtime should be smaller. Thanks!Braces
Seems like the best solution. You should be careful of files without trailing newlines; on those, you will never select the last line. For this reason, and because reading forwards in the file is generally easier than reading backwards, you might want to treat newlines as beginning of lines rather than ends. If you choose n, check if n is 0 or if the character at position n-1 in the file is a newline; if so, read from character n until the end of its line. If you do this, you will want to check first if the last character of the file is a newline, and subtract 1 from your range if so.Comrade
B
2

Solution was found, which works surprisingly well. Documenting here for myself and others.

This example code does around 80,000 draws per second in practice, with a mean line length that matches that of the file to 4 significant digits on most runs. In contrast, I get around 250 draws per second using the method from the cross referenced question.

Essentially what it does is sample a random place in the file, and then discard it and draw again with probability inversely proportionate to the line length. This cancels out the bias for longer words. On average, the method makes a number of draws equal to the average line length in the file before accepting one.

Some notable drawbacks:

  • Files with longer line lengths will produce more rejections per draw, making this much slower.
  • Files with longer line lengths require a larger constant than 50 in the rdraw function, which appears to mean much longer seek times in practice if line lengths exhibit high variance. For instance, setting it to BUFSIZ on one file I tested with reduced draw speeds to around 10000 draws per second. Still much faster than counting lines in the file though.

    int rdraw(FILE* where, char *storage, size_t bytes){
        int offset = (int)(bytes*drand48());
        int initial_seek = offset>50?offset-50:0;
        fseek(where, initial_seek, SEEK_SET);
        int chars_read = 0;
        while(chars_read + initial_seek < offset){
                fgets(storage,50,where);
                chars_read += strlen(storage);
        }
        return strlen(storage);
    }
    
    int main(){
        srand48(time(NULL));
        struct stat blah;
        stat("/usr/share/dict/words", &blah);
        FILE *where = fopen("/usr/share/dict/words", "r");
        off_t bytes = blah.st_size;
        char b[BUFSIZ+1];
    
        int i;
        for(i=0;i<1000000; i++){
                while(drand48() > 1.0/(rdraw(where, b, bytes)));
        }
    
    }
    
Braces answered 21/11, 2012 at 15:0 Comment(0)
C
2

Select a random character from the file (via rand and seek as you noted). Now, instead of finding the associated newline, since that is biased as you noted, I would apply the following algorithm:


Is the character a newline character?
   yes - use the preceeding line
   no  - try again

I can't see how this could give anything but a uniform distribution of lines. The efficiency depends on the average length of a line. If your file has relatively short lines, this could be workable, though if the file really can't be precached even by the OS, you might pay a heavy price in physical disk seeks.

Chess answered 21/11, 2012 at 20:11 Comment(2)
Makes sense to me. I think this ends up with identical performance to my answer though, since on average you need to draw total_bytes/#newlines characters, which is just the average line length in the system. It might be a little faster though because you avoid reading in a whole word until after you've selected it, the coefficients in its runtime should be smaller. Thanks!Braces
Seems like the best solution. You should be careful of files without trailing newlines; on those, you will never select the last line. For this reason, and because reading forwards in the file is generally easier than reading backwards, you might want to treat newlines as beginning of lines rather than ends. If you choose n, check if n is 0 or if the character at position n-1 in the file is a newline; if so, read from character n until the end of its line. If you do this, you will want to check first if the last character of the file is a newline, and subtract 1 from your range if so.Comrade
W
1

If the file only changes in the end (more lines are added) you can create an algorithm with uniform probability:

Preparation: Create an index file that contains the offset for each n:th line. Use a fixed-width format so that the position can be used to determine which record you have.

  1. Open the index file and read the last record. Use ftell to determine the record number.

  2. Open the big file and fseek to the offset obtained in step 1.

  3. Read the big file to the end, counting the number of newlines. You now have the total number of lines in the big file.

  4. Generate a random number up to the number of lines obtained in step 3.

  5. fseek to, and read, the appropriate record in the index file.

  6. fseek to the appropriate offset in the large file. Skip the remainder of newlines.

  7. Read the line!

Example

Let's assume we chose n=100 and that the large file contains 367 lines.

Index file:

00000000,00004753,00009420,00016303
  1. The index file has 4 records, so the large file containsat least 300 records (100* (4-1)). Last offset is 16303.

  2. Open the large file and fseek to 16303.

  3. Count the remaining number of lines (67).

  4. Generata a random number in the range [0-366]. Let's say we got 112.

  5. 112/100 = 1 with 12 as remainder. Read the index file record with offset 1. We get the result 4753.

  6. fseek to 4753 in the large file and then skip 11 (12-1) lines.

  7. Read the 12th line.

Voila!

Edit:

I saw the comment on the target file changing. If the target file changes only rarely, then this may still be a viable approach. You would need to create an new index file before switching target file. You may also want to update the index file when the target file has grown more than n rows.

Weaponeer answered 20/11, 2012 at 17:37 Comment(3)
Nice, and quite elegant, but unfortunately, in my application, the file's contents can change arbitrarily. Thanks for the suggestion though. I'll have to have a look at how often arbitrary updates happen in practice (since appends are more common in my case). Might be this amortizes out nicely even if I have to rebuild the index once in a while.Braces
If you do some statistical analysis on your target files, you can calculate how the row length variations affect the probability in the approach you suggested. Perhaps the probability skewing is acceptable with a sample size of 20-100 lines?Span
Yes, that seems like a possibility too. I'm also looking into rejection sampling to adjust for the bias in the distribution, but it seems like it could be expensive.Braces

© 2022 - 2024 — McMap. All rights reserved.