What's an easy way to read random line from a file?
Asked Answered
L

13

270

What's an easy way to read random line from a file in a shell script?

Loosetongued answered 15/1, 2009 at 19:1 Comment(3)
Is each line padded to a fixed length?Florencia
no, each line has variable number of charactersLoosetongued
large file: #29103089Busterbustle
K
395

You can use shuf:

shuf -n 1 $FILE

There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.

rl -c 1 $FILE
Kurtzig answered 15/1, 2009 at 19:30 Comment(17)
i really like that shuf approach!Cobos
Thanks for the shuf tip, it's built-in in Fedora.Unbiased
Does this r1 have any advantages? shuf seams to work perfectly!Tannatannage
shuf is great as a drop-in replacement for head command, good to knowCoquelicot
Andalso, sort -R is definitely going to make one wait a lot if dealing with considerably huge files -- 80kk lines --, whereas, shuf -n acts quite instantaneously.Aconcagua
You can get shuf on OS X by installing coreutils from Homebrew. Might be called gshuf instead of shuf.Harmon
Similarly, you can use randomize-lines on OS X by brew install randomize-lines; rl -c 1 $FILESelfeducated
@Rubens: the same questionAretino
@J.F.Sebastian: the same answerAconcagua
@ThomasAhle, the Debian package summary for r1's randomize-lines states Users are recommended to use the shuf command instead which should be available by default. This package may be considered deprecated. Therefore, shuf appears preferable.Ryon
Note that shuf is part of GNU Coreutils and therefore won't necessarily be available (by default) on *BSD systems (or Mac?). @Tracker1's perl one-liner below is more portable (and by my tests, is slightly faster).Ryon
Why is this answer in the bottom though it has the most upvotes?Nicks
@Nicks are you sorting by age?Twodimensional
This is a cool command! Yet another wheel I've reinvented not knowing it already exists in my flavor of Unix! Thank you!Feucht
though this is not suitable for huge files... I'm getting a 'shuf: read error: Cannot allocate memory' on a 70GB fileHelse
This is a great answer. I would just like to point out that in case more than 1 line is needed, shuf and rl make permutations of lines, not random draws. I.e. if you want to draw k random lines, you will want to run shuf -n 1 k times. This will draw from N^k possibilities instead of N!/(N-k)! possibilities, where N is the total number of lines. E.g., get 7 random lines from wordlist.txt: for n in {1..7}; do shuf -n1 wordlist.txt; doneKaolin
you can use process substitution if you don't want to give shuf a file: shuf -n 1 <(echo -e "heads\ntails") will randomly pick "heads" or "tails". Or just pipe to it: echo -e "heads\ntails" | shuf -n 1Simplism
H
74

Another alternative:

head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1
Hardman answered 16/1, 2009 at 8:54 Comment(6)
${RANDOM} only generates numbers less than 32768, so don't use this for large files (for example the English dictionary).Exiguous
This does not give you the precise same probability for every line, due to the modulo operation. This does barely matter if the file length is << 32768 (and not at all if it divides that number), but maybe worth noting.Organzine
You can extend this to 30-bit random numbers by using (${RANDOM} << 15) + ${RANDOM}. This significantly reduces the bias and allows it to work for files containing up to 1 billion lines.Underwater
@nneonneo: Very cool trick, though according to this link it should be OR'ing the ${RANDOM}'s instead of PLUS'ing https://mcmap.net/q/95272/-random-number-from-a-range-in-a-bash-scriptWitte
+ and | are the same since ${RANDOM} is 0..32767 by definition.Underwater
There's a heavy performance penalty to this, since it needs to count lines to be sure it's reading to the right point.Callicrates
S
74
sort --random-sort $FILE | head -n 1

(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)

Scene answered 10/11, 2010 at 12:28 Comment(8)
+1 I like it, but you may need a very recent sort, didn't work on any of my systems (CentOS 5.5, Mac OS 10.7.2). Also, useless use of cat, could be reduced to sort --random-sort < $FILE | head -n 1Affluent
sort -R <<< $'1\n1\n2' | head -1 is as likely to return 1 and 2, because sort -R sorts duplicate lines together. The same applies to sort -Ru, because it removes duplicate lines.Yeseniayeshiva
This is relatively slow, since the whole file needs to get shuffled by sort before piping it to head. shuf selects random lines from the file, instead and is much faster for me.Neckpiece
@SteveKehlet while we're at it, sort --random-sort $FILE | head would be best, as it allows it to access the file directly, possibly enabling efficient parallel sortingInternationalist
@Internationalist Good improvement!Affluent
The --random-sort and -R options are specific to GNU sort (so they won't work with BSD or Mac OS sort). GNU sort learned those flags in 2005 so you need GNU coreutils 6.0 or newer (eg CentOS 6).Ingamar
from Wikipedia: "this is not a full random shuffle because it will sort identical lines together"Wessels
@Bengt: nothing is written until shuf reads the whole file into memory. sort may work even if the file does not fit in memory.Aretino
I
31

This is simple.

cat file.txt | shuf -n 1

Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.

Irretrievable answered 23/5, 2016 at 7:1 Comment(1)
Best answer. I didn't know about this command. Note that -n 1 specifies 1 line, and you can change it to more than 1. shuf can be used for other things too; I just piped ps aux and grep with it to randomly kill processes partially matching a name.Absolute
F
20

perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:

perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file

This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

Florencia answered 15/1, 2009 at 19:6 Comment(6)
Just for the purposes of inclusion (in case the referred site goes down), here's the code that Tracker1 pointed to: "cat filename | perl -e 'while (<>) { push(@_,$_); } print @_[rand()*@_];';"Urbai
This is a useless use of cat. Here's a slight modification of the code found in perlfaq5 (and courtesy of the Camel book): perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' filenameQuantitative
err... the linked site, that isGathering
I just benchmarked an N-lines version of this code against shuf. The perl code is very slightly faster (8% faster by user time, 24% faster by system time), though anecdotally I've found the perl code "seems" less random (I wrote a jukebox using it).Ryon
More food for thought: shuf stores the whole input file in memory, which is a horrible idea, while this code only stores one line, so the limit of this code is a line count of INT_MAX (2^31 or 2^63 depending on your arch), assuming any of its selected potential lines fits in memory.Ryon
here's the awk equivalent. either of these answers (perl or awk) are better than the accepted for - portability, speed, and ability to manage huge files easily. awk 'BEGIN{srand()}{rand()*NR<1&&l=$0}END{print l}' file or some_input | awk 'BEGIN{srand()}{rand()*NR<1&&l=$0}END{print l}'Lindsey
A
11

using a bash script:

#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}
Adlai answered 15/1, 2009 at 19:12 Comment(10)
Random can be 0, sed needs 1 for the first line. sed -n 0p returns error.Shani
mhm - how about $1 for "tmp.txt" and $2 for NUM ?Gwenn
but even with the bug worth a point, as it does not need perl or python and is as efficient as you can get (reading the file exactly twice but not into memory - so it would work even with huge files).Gwenn
@asalamon74: thanks @blabla999: if we make a function out of it, ok for $1, but why not computing NUM?Adlai
Changing the sed line to: head -${X} ${FILE} | tail -1 should do itNitrobenzene
useless use of cat detected, wc happily takes files directlySensuous
@Hasturkun: beware - the output of wc depends on whether it reads stdin or a file name off its command line. Granted, 'wc -l < $FILE' would be OK; using 'wc -l $FILE' (no redirection) would be a bug.Fisticuffs
@Sensuous & J.Leffler: the cat was meant to avoid wc printing the file name. Fixed with the 'wc -l < $FILE' suggestion, thanksAdlai
The variable names should be quoted, especially $FILE. The curly braces are superfluous here. I recommend using lowercase or mixed-case variable names to avoid potential name collisions with shell or environment variables.Ramshackle
If a file has 32769 or more lines, the last ones are never selected. wc - l shouldn't have a space.Yeseniayeshiva
S
4

Single bash line:

sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt

Slight problem: duplicate filename.

Shani answered 15/1, 2009 at 19:17 Comment(2)
slighter problem. performing this on /usr/share/dict/words tends to favor words starting with "A". Playing with it, I'm at about 90% "A" words to 10% "B" words. None starting with numbers yet, which make up the head of the file.Phosphatase
wc -l < test.txt avoids having to pipe to cut.Minoan
H
3

Here's a simple Python script that will do the job:

import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])

Usage:

python randline.py file_to_get_random_line_from
Hairy answered 15/1, 2009 at 19:7 Comment(6)
This doesn't quite work. It stops after a single line. To make it work, I did this: import random, sys lines = open(sys.argv[1]).readlines() for i in range(len(lines)): rand = random.randint(0, len(lines)-1) print lines.pop(rand),Channa
Stupid comment system with crappy formatting. Didn't formatting in comments work once upon a time?Channa
randint is inclusive therefore len(lines) may lead to IndexError. You could use print(random.choice(list(open(sys.argv[1])))). There is also memory efficient reservoir sampling algorithm.Aretino
Quite space hungry; consider a 3TB file.Masquerade
@MichaelCampbell: reservoir sampling algorithm that I've mentioned above may work with 3TB file (if line size is limited).Aretino
Using py is nice. -l assigns incoming lines to a list, l. py auto-imports stdlib modules. so you can do cat $FILE | py -l "random.choice(l)". Try it: python -m this | py -l "random.choice(l)" ... erm actually just py this | py -l "random.choice(l)" ;)Pulque
E
2

Another way using 'awk'

awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name
Earthlight answered 4/9, 2013 at 6:43 Comment(2)
That uses awk and bash ($RANDOM is a bashism). Here is a pure awk (mawk) method using the same logic as @Tracker1's cited perlfaq5 code above: awk 'rand() * NR < 1 { line = $0 } END { print line }' file.name (wow, it's even shorter than the perl code!)Ryon
That code must read the file (wc) in order to get a line count, then must read (part of) the file again (awk) to get the content of the given random line number. I/O will be far more expensive than getting a random number. My code reads the file once only. The issue with awk's rand() is that it seeds based on seconds, so you'll get duplicates if you run it consecutively too fast.Ryon
M
1

A solution that also works on MacOSX, and should also works on Linux(?):

N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file 

Where:

  • N is the number of random lines you want

  • NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2 --> save line numbers written in file1 and then print corresponding line in file2

  • jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.
Moldavia answered 17/8, 2015 at 9:10 Comment(0)
C
0

Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:

sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME

(This works even if FILENAME is empty, in which case no line is emitted.)

One possible advantage of this approach is that it only calls rand() once.

As pointed out by @AdamKatz in the comments, another possibility would be to call rand() for each line:

awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME

(A simple proof of correctness can be given based on induction.)

Caveat about rand()

"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."

-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html

Camenae answered 14/12, 2015 at 21:43 Comment(1)
See the comment I posted a year before this answer, which has a simpler awk solution that doesn't require sed. Also note my caveat about awk's random number generator, which seeds at whole seconds.Ryon
L
0
#!/bin/bash

IFS=$'\n' wordsArray=($(<$1))

numWords=${#wordsArray[@]}
sizeOfNumWords=${#numWords}

while [ True ]
do
    for ((i=0; i<$sizeOfNumWords; i++))
    do
        let ranNumArray[$i]=$(( ( $RANDOM % 10 )  + 1 ))-1
        ranNumStr="$ranNumStr${ranNumArray[$i]}"
    done
    if [ $ranNumStr -le $numWords ]
    then
        break
    fi
    ranNumStr=""
done

noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}
Lucey answered 15/6, 2017 at 13:0 Comment(1)
Since $RANDOM generates numbers less than the number of words in /usr/share/dict/words, which has 235886 (on my Mac anyway), I just generate 6 separate random numbers between 0 and 9 and string them together. Then I make sure that number is less than 235886. Then remove leading zeros to index the words that I stored in the array. Since each word is its own line this could easily be used for any file to randomly pick a line.Lucey
M
0

Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.

  RANDOM1=`jot -r 1 1 235886`
   #range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
   echo $RANDOM1
   head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1

The echo of the variable is to get a visual of the generated random number.

Mildred answered 23/8, 2017 at 7:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.