Randomly Pick Lines From a File Without Slurping It With Unix
Asked Answered
C

11

53

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file
Chirp answered 28/3, 2009 at 5:55 Comment(4)
FYI 1% of 10,000,000 is too big -- you only need about 1000 to have a +/-3% margin of error and 10,000 for a +/-1% MOE.Effeminacy
@Steven: thanks so much for this. How did you derive the MOE figure above? Any reference? I truly want to learn more since my stat background is weak. BTW, your opinion seems to be different from cadrian's below (i.e. 1% is not enough)Chirp
americanresearchgroup.com/moe.htmlEffeminacy
@StevenHuwig The link seems to be offline. Cache: web.archive.org/web/20131104235055/http://… Wikipedia: en.wikipedia.org/wiki/Margin_of_errorAttend
E
91

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

If you'd like the header line plus a random sample of lines after, use:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
Egis answered 28/3, 2009 at 6:4 Comment(10)
The original can technically try to choose an element beyond the end of the array, since it's using delete to avoid duplicates. This one doesn't have that problem.Amabil
@Steven: true, but statistically with such a huge file I don't think it will. I tried with "man awk" which is 2300 lines long and I always had a ~20 lines (between 17 and 22) extract.Egis
@Steven: Theoretically yes, practically no. The probability of printing no lines, for the OP's 10 million line file (given by the Poisson distribution), is about 10^-43430. Equivalent to cracking a 144 kilobyte encryption key by guessing on the first try.Mina
Also, for the OP's case (10^7 lines) there's a 99.8% chance that the chosen number of elements will be within 1% of 10,000.Mina
Good point, but still, a correct algorithm needs to be correct, not "extremely likely to be correct." IMHO anyway.Effeminacy
Note that this algorithm will have a bias towards longer lines. If you want to correct against this, you must weight your sampling (e.g. sample lines that are 2x longer 1/2 as often). Precisely how to do this is left as an exercise for the reader (hint: build an empirical distribution of line lengths).Redblooded
@Steven: No. An algorithm needs to be correct enough. In this case it is correct enough.Sitsang
@Steven: You should perhaps study up on the definition of algorithm. Also be sure never to use UUIDs because they could "theoretically" collide.Sitsang
@Sitsang The problem as originally stated (four years ago, don't you have anything better to do?) asked for 1% of lines, not 1% of lines within some arbitrary tolerance. The method outlined in this answer is unsound for producing 1% of the lines of the input. UUIDs have nothing to do with the question.Effeminacy
Does the regex mentioned here match every line? If so, are you just doing it because it's a clever way to avoid doing a for loop over every line?Carminacarminative
F
59

You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):

cat your_file.txt | perl -n -e 'print if (rand() < .01)'

(simpler form, from comments):

perl -ne 'print if (rand() < .01)' your_file.txt 
Fulmination answered 28/3, 2009 at 6:2 Comment(6)
Theoretically that could print no values.Effeminacy
See my comments on cadrian's answerMina
@Steven, read the original post. His file had 10^7 lines. The odds of him not getting output are .99^100000000. Saying that isn't acceptable is ridiculous. If you're worried about that level of error, you shouldn't be using a computer. You're more likely to get incorrect output due to comsic rays.Fulmination
Okay, now that's legit. :-)Fulmination
This version (second example) is crazy fast. I just ran through a 15GB file on an SSD in no time flat. Got a 150MB file back. Nice.Winebibber
And in contrast, the awk version above is crazy slow. To the next random person who finds this answer from the googles: try this answer first!Stigmasterol
E
22

I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

Original thread: Submitted for your review -- random sampling in awk.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()

}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next

}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }

}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" \
            > "/dev/stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        print newpool[ordered[i]]

}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR  
} 
Effeminacy answered 28/3, 2009 at 7:46 Comment(5)
This is "randomly choose n lines" instead of "randomly choose 1/100 lines", so a little precalculation is needed. Still cool, though.Vibrator
Yep, though I commented on the question that "randomly choose n lines" is better -- sample size isn't in direct proportion to population size.Effeminacy
here's Reservoir Sampling algorithm implementation in Python (without preserving order). Note: try/except here doesn't buy you anything: the probability that the replacement occurs for i-th item is k / i where i >> k (note: the probability that i-th item is chosen is k / n where n is the total number of items).Underside
This is a great script, all I did was add #!gawk -f as the first line to make it work on my system (and gawk brew install gawk cause I have a mac) I just have one question: Is this truly reservoir sampling?Seminary
@JeremyIglehart Yes, the reservoir is called pool here. It's not exactly the same as envisioned in the original materials, because of the dynamic associative arrays in awk vs more static arrays in other languages, but since the important characteristic of reservoir sampling is being "on-line" rather than it having a particular algorithmic complexity, I think it still counts. My added requirement of preserving order is just running two algorithms at once over the same collection, which doesn't change the reservoir sampling part into not being a reservoir sampling part.Effeminacy
H
17

This should work on most any GNU/Linux machine.

$ shuf -n $(( $(wc -l < $file) / 100)) $file

I'd be surprised if memory management was done inappropriately by the GNU shuf command.

Hahnert answered 29/3, 2009 at 5:36 Comment(3)
If you look at the source you'll see that shuf does indeed read the entire file into memory first, unfortunately.Brittaniebrittany
wc -l < $file doesn't output the file name so you can omit the cut command.Pannonia
this is nice, but very inefficent, since we only need a random subset, not the subset in a random order.Outpatient
L
5

I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.

Lorenzen answered 21/9, 2012 at 18:20 Comment(1)
here's Python implementation of reservoir sampling algorithm (k==1).Underside
H
5

The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)

A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.

Hillery answered 22/11, 2013 at 23:2 Comment(0)
C
5

In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:

#!/usr/bin/env awk -f
BEGIN{
    srand();
    if(k=="") k=10
}

NR <= k {
    reservoir[NR-1] = $0;
    next;
}

{ i = int(NR * rand()) }

i < k { reservoir[i] = $0 }

END {
    for (i in reservoir) {
        print reservoir[i];
    }
}

If saved as sample_lines and made executable, it can be run like: ./sample_lines -v k=5 input_file. If k is not given, then 10 will be used by default.

Then figuring out what k is has to be done separately, for example by setting -v "k=$(dc -e "$(cat input_file | wc -l) 100 / n")"

Capillaceous answered 19/2, 2018 at 15:52 Comment(0)
L
3

You could do it in two passes:

  • Run through the file once, just to count how many lines there are
  • Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
  • Run through the file once more and pick out the lines at the selected positions

Example in python:

fn = '/usr/share/dict/words'

from random import randint
from sys import stdout

count = 0
with open(fn) as f:
   for line in f:
      count += 1

selected = set()
while len(selected) < count//100:
   selected.add(randint(0, count-1))

index = 0
with open(fn) as f:
   for line in f:
      if index in selected:
          stdout.write(line)
      index += 1
Lousewort answered 28/3, 2009 at 6:23 Comment(0)
J
1

Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

Jape answered 28/3, 2009 at 6:3 Comment(2)
This leads to a different distribution of random lines. E.g., you'll never have two from the same 100-line set.Tarryn
It would also affect the order. You'd never have a line from the third set before a line from the second set. Important considerations, for sure.Jape
T
1

If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:

file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
  BEGIN {srand(); p = int(n * percent)}
  rand() * n-- < p {p--; print}' < "$file"
Transcurrent answered 11/6, 2017 at 8:6 Comment(0)
G
0

Here's my version. In the below 'c' is the number of lines to select from the input. Making c a parameter is left as an exercise for the reader, as is the reason the line starting with c/NR works to reliably select exactly c lines (assuming input has at least c lines).

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"
Gokey answered 15/6, 2022 at 2:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.