Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.
Update: I want to the probabilities of getting each line to be equal.
Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.
Update: I want to the probabilities of getting each line to be equal.
Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).
The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....
import java.io.*;
import java.util.*;
public class B {
public static void main(String[] args) throws FileNotFoundException {
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 1000; ++i)
{
String s = choose(new File("g:/temp/a.txt"));
if(!map.containsKey(s))
map.put(s, 0);
map.put(s, map.get(s) + 1);
}
System.out.println(map);
}
public static String choose(File f) throws FileNotFoundException
{
String result = null;
Random rand = new Random();
int n = 0;
for(Scanner sc = new Scanner(f); sc.hasNext(); )
{
++n;
String line = sc.nextLine();
if(rand.nextInt(n) == 0)
result = line;
}
return result;
}
}
Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:
This is a variant of rejection sampling.
Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).
It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.
List<Integer>
of offsets, which may then be randomized via Collections.shuffle()
. –
Unburden Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).
The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....
import java.io.*;
import java.util.*;
public class B {
public static void main(String[] args) throws FileNotFoundException {
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 1000; ++i)
{
String s = choose(new File("g:/temp/a.txt"));
if(!map.containsKey(s))
map.put(s, 0);
map.put(s, map.get(s) + 1);
}
System.out.println(map);
}
public static String choose(File f) throws FileNotFoundException
{
String result = null;
Random rand = new Random();
int n = 0;
for(Scanner sc = new Scanner(f); sc.hasNext(); )
{
++n;
String line = sc.nextLine();
if(rand.nextInt(n) == 0)
result = line;
}
return result;
}
}
Either you
read the file twice - once to count the number of lines, the second time to extract a random line, or
Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservoir sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservoir sampling, based on this and the various descriptions on the web.
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservoirSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List<String> myList = mySampler.sampler(10);
for(int index = 0;index<myList.size();index++){
System.out.println(myList.get(index));
}
}
}
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;
public class Sampler {
public Sampler(){}
public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservoirList is where our selected lines stored
List <String> reservoirList= new ArrayList<String>(reservoirSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservoirSize)
{
reservoirList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
{
reservoirList.set(randomNumber, currentLine);
}
}
return reservoirList;
}
}
The basic premise is that you fill up the reservoir, and then go back to it and fill in random lines with a 1/ReservoirSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.
Use RandomAccessFile:
Using this method, I've been sampling lines from the Brown Corpus at random, and can easily retrieve a 1000 random samples from randomly chosen files in a few seconds. If I tried to do the same by reading through each file line-by-line it would take me much longer.
The same principle can be used for selecting random elements from a list. Rather than reading through the list and stopping at a random place, if you generate a random number between 0 and the length of the list, then you can index directly into the list.
Reading a random line from a file in java:
public String getRandomLineFromTheFile(String filePathWithFileName) throws Exception {
File file = new File(filePathWithFileName);
final RandomAccessFile f = new RandomAccessFile(file, "r");
final long randomLocation = (long) (Math.random() * f.length());
f.seek(randomLocation);
f.readLine(); //will move file pointer to the next line
String randomLine = f.readLine();
f.close();
return randomLine;
}
See the RandomAccessFile documentation.
Use a BufferedReader and read line wise. Use the java.util.Random object to stop randomly ;)
© 2022 - 2024 — McMap. All rights reserved.