How to get a random line of a text file in Java?
Asked Answered
H

7

28

Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.

Update: I want to the probabilities of getting each line to be equal.

Hylozoism answered 7/2, 2010 at 19:35 Comment(0)
P
27

Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).

The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....

import java.io.*;
import java.util.*;

public class B {

  public static void main(String[] args) throws FileNotFoundException {
     Map<String,Integer> map = new HashMap<String,Integer>();
     for(int i = 0; i < 1000; ++i)
     {
        String s = choose(new File("g:/temp/a.txt"));
        if(!map.containsKey(s))
           map.put(s, 0);
        map.put(s, map.get(s) + 1);
     }

     System.out.println(map);
  }

  public static String choose(File f) throws FileNotFoundException
  {
     String result = null;
     Random rand = new Random();
     int n = 0;
     for(Scanner sc = new Scanner(f); sc.hasNext(); )
     {
        ++n;
        String line = sc.nextLine();
        if(rand.nextInt(n) == 0)
           result = line;         
     }

     return result;      
  }
}
Platt answered 7/2, 2010 at 19:52 Comment(3)
An implementation of reservoir samplingHaematocryal
Amazing. Never heard about reservoir sampling. What about if my file is MBs? Are there any performace issues? If yes, are there alternatives to avoid a full file scan?Bead
Am I right and assuming this is for a fixed n=1, where n is number of 'samples'? Is there a way to get choose to choose more than one at once? as it stands you 'loop over the tape' more than once, or at least attempt to which seems inefficient.Enlil
L
29

Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:

  1. Use RandomAccessFile to seek to a random byte position in the file.
  2. Seek left and right to the next line terminator. Let L the line between them.
  3. With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.

This is a variant of rejection sampling.

Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).

It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.

Lenhart answered 7/2, 2010 at 21:13 Comment(2)
Excellent! If the file will be be sampled repeatedly, use a single pass to collect a List<Integer> of offsets, which may then be randomized via Collections.shuffle().Unburden
This should be the best answer.Huba
P
27

Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).

The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....

import java.io.*;
import java.util.*;

public class B {

  public static void main(String[] args) throws FileNotFoundException {
     Map<String,Integer> map = new HashMap<String,Integer>();
     for(int i = 0; i < 1000; ++i)
     {
        String s = choose(new File("g:/temp/a.txt"));
        if(!map.containsKey(s))
           map.put(s, 0);
        map.put(s, map.get(s) + 1);
     }

     System.out.println(map);
  }

  public static String choose(File f) throws FileNotFoundException
  {
     String result = null;
     Random rand = new Random();
     int n = 0;
     for(Scanner sc = new Scanner(f); sc.hasNext(); )
     {
        ++n;
        String line = sc.nextLine();
        if(rand.nextInt(n) == 0)
           result = line;         
     }

     return result;      
  }
}
Platt answered 7/2, 2010 at 19:52 Comment(3)
An implementation of reservoir samplingHaematocryal
Amazing. Never heard about reservoir sampling. What about if my file is MBs? Are there any performace issues? If yes, are there alternatives to avoid a full file scan?Bead
Am I right and assuming this is for a fixed n=1, where n is number of 'samples'? Is there a way to get choose to choose more than one at once? as it stands you 'loop over the tape' more than once, or at least attempt to which seems inefficient.Enlil
H
10

Either you

  1. read the file twice - once to count the number of lines, the second time to extract a random line, or

  2. use reservoir sampling

Haematocryal answered 7/2, 2010 at 19:39 Comment(0)
E
6

Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservoir sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservoir sampling, based on this and the various descriptions on the web.

import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;

public class reservoirSampling {

    public static void main(String[] args) throws FileNotFoundException, IOException{
        Sampler mySampler = new Sampler();
        List<String> myList = mySampler.sampler(10);
        for(int index = 0;index<myList.size();index++){
            System.out.println(myList.get(index));
        }
    }
}

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;

public class Sampler {

    public Sampler(){}
    public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
    {
        String currentLine=null;
        //reservoirList is where our selected lines stored
        List <String> reservoirList= new ArrayList<String>(reservoirSize); 
        // we will use this counter to count the current line number while iterating
        int count=0; 

        Random ra = new Random();
        int randomNumber = 0;
        Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
        while (sc.hasNext())
        {
            currentLine = sc.next();
            count ++;
            if (count<=reservoirSize)
            {
                reservoirList.add(currentLine);
            }
            else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
            {
                reservoirList.set(randomNumber, currentLine);
            }
        }
        return reservoirList;
    }
}

The basic premise is that you fill up the reservoir, and then go back to it and fill in random lines with a 1/ReservoirSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.

Enlil answered 1/10, 2012 at 19:15 Comment(1)
I've put this up for review.Enlil
Q
2

Use RandomAccessFile:

  1. Construct a RandomAccessFile, file
  2. Get the length of that file, filelen, by calling file.length()
  3. Generate a random number, pos, between 0 and filelen
  4. Call file.seek(pos) to seek to the random position
  5. Call file.readLine() to get to the end of the current line
  6. Read the next line by calling file.readLine() again

Using this method, I've been sampling lines from the Brown Corpus at random, and can easily retrieve a 1000 random samples from randomly chosen files in a few seconds. If I tried to do the same by reading through each file line-by-line it would take me much longer.

The same principle can be used for selecting random elements from a list. Rather than reading through the list and stopping at a random place, if you generate a random number between 0 and the length of the list, then you can index directly into the list.

Quantize answered 7/12, 2018 at 16:50 Comment(0)
S
2

Reading a random line from a file in java:

public String getRandomLineFromTheFile(String filePathWithFileName) throws Exception {
        File file = new File(filePathWithFileName); 
        final RandomAccessFile f = new RandomAccessFile(file, "r");
        final long randomLocation = (long) (Math.random() * f.length());
        f.seek(randomLocation);
        f.readLine(); //will move file pointer to the next line
        String randomLine = f.readLine();
        f.close();
        return randomLine;
    }

See the RandomAccessFile documentation.

Sen answered 8/3, 2021 at 10:31 Comment(0)
G
-1

Use a BufferedReader and read line wise. Use the java.util.Random object to stop randomly ;)

Glassblowing answered 7/2, 2010 at 19:36 Comment(4)
How do I ensure the file isn't over when I want to stop? I.e. how do I know the number of lines if a file?Hylozoism
Also, I want the probalilities of getting each single line to be equal.Hylozoism
@Dinuk, so if the file is smaller than the others, I'll get the last line too often, if the file is larger - I'll get it too seldomHylozoism
THen you have to read the file twice or if all lines have an equal length, you can calculate the number of lines from the file sizeGlassblowing

© 2022 - 2024 — McMap. All rights reserved.