Read a small random sample from a big CSV file into a Pandas data frame
Asked Answered
E

13

110

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

Elmiraelmo answered 7/3, 2014 at 19:0 Comment(4)
You can read a specific number of rows and skip using nrows and skiprows param, I don't know how you could read a random number of lines though using read_csvJimmie
See related: #10820411 although the problem here is appending to your dataframe 10,000 times. Even if you built a list or dict for the temporary storage this would be slow and wasteful IMOJimmie
Here is how to do it with an HDF5 Files; straightforward to simply convert your csv to HDF5 then use this recipe: #21040272Yeung
skiprows=lambda i: i % k for every kth rowRounder
V
109

Assuming no header in the CSV file:

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

With header and unknown file length:

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)
Vesicate answered 7/3, 2014 at 19:29 Comment(4)
Just make sure your read the header first, or chop it off all together, nasty things happen when you don't .Argonaut
Edited to include the case where the file has a header.Vesicate
Now skiprows does accept a callable, see my answer below: https://mcmap.net/q/194883/-read-a-small-random-sample-from-a-big-csv-file-into-a-pandas-data-frameEdirne
Use range() for Python 3.xJustificatory
E
78

@dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

Note also that their answer for unknown file length relies on iterating through the file twice -- once to get the length, and then another time to read the csv. I have three solutions here which only rely on iterating through the file once, though they all have tradeoffs.

Solution 1: Approximate Percentage

If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

As pointed out in the comments, this only gives approximately the right number of lines, but I think it satisfies the desired usecase.

Solution 2: Every Nth line

This isn't actually a random sample, but depending on how your input is sorted and what you're trying to achieve, this may meet your needs.

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

Solution 3: Reservoir Sampling

(Added July 2021)

Reservoir sampling is an elegant algorithm for selecting k items randomly from a stream whose length is unknown, but that you only see once.

The big advantage is that you can use this without having the full dataset on disk, and that it gives you an exactly-sized sample without knowing the full dataset size. The disadvantage is that I don't see a way to implement it in pure pandas, I think you need to drop into python to read the file and then construct the dataframe afterwards. So you may lose some functionality from read_csv or need to reimplement it, since we're not using pandas to actually read the file.

Taking an implementation of the algorithm from Oscar Benjamin here:

from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO

def reservoir_sample(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items

    from https://bugs.python.org/issue41311#msg373733
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

def sample_file(filepath, k):
    with open(filepath, 'r') as f:
        header = next(f)
        result = [header] + sample_iter(f, k)
    df = pd.read_csv(StringIO(''.join(result)))

The reservoir_sample function returns a list of strings, each of which is a single row, so we just need to turn it into a dataframe at the end. This assumes there is exactly one header row, I haven't thought about how to extend it to other situations.

I tested this locally and it is much faster than the other two solutions. Using a 550 MB csv (January 2020 "Yellow Taxi Trip Records" from the NYC TLC), solution 3 runs in about 1 second, while the other two take ~3-4 seconds.

In my test this is even slightly (~10-20%) faster than @Bar's answer using shuf, which surprises me.

Edirne answered 2/2, 2018 at 19:38 Comment(4)
This solution doesn't guarantee exactly X% of lines. What if random.random() returned a number more than 0.01 all the time.Kella
@Kella that's correct, it doesn't guarantee exactly X%.Edirne
here's another reservoir_sample(it, k) implementation in Python (perhaps, it might be less efficient but it is easier to understand)Jehad
@Edirne is sample_iter supposed to be included in this answer?Midriff
A
37

This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:

shuf -n 100000 data/original.tsv > data/sample.tsv

The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.

Relevant question: https://unix.stackexchange.com/q/108581

Benchmark on a 7M lines csv available here (2008):

Top answer:

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

Timing for pandas:

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

While using shuf:

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

So shuf is about 12x faster and importantly does not read the whole file into memory.

Accad answered 10/6, 2016 at 17:50 Comment(4)
I would just add to take out the header lines (with tail, for example).Darwindarwinian
tail -n +2 <file> | shuf -n <nrows> -o <newfile> && sed -i '1i<header>' <newfile> doesn't seem very exquisite but it worked for me.Betake
For mac users who don't find shuf right away, please brew install first with brew install coreutils and then use the equivalence gshuf. This solution is much faster than calling random.Johm
For Windows users you can use Git Bash (comes included in Git for Windows)Womanhater
A
11

Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

See code below:

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line
Amido answered 18/3, 2016 at 18:5 Comment(1)
But doesn't enumerate require loading the entire file into memory?Crystlecs
P
4

The following code reads first the header, and then a random sample on the other lines:

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)
Paramour answered 10/1, 2015 at 14:50 Comment(1)
This solution does not look at the number of existing rows in the csv file which is static here.Benedictbenedicta
F
3
class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

something like that should work I think

Figureground answered 7/3, 2014 at 19:29 Comment(0)
C
3

No pandas!

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

You'll end up with a sampled_lines list. What kind of statistics do you mean?

Cha answered 7/3, 2014 at 23:8 Comment(1)
great to have the code without installing a module...I added the following to get a txt file output ---- filename='random_lines.csv' target = open(filename, 'w') and then in "if not is_EOF" I added target.write(f.readline()) target.write("\n")Writ
L
2

use subsample

pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv
Lambard answered 17/4, 2018 at 21:23 Comment(3)
Created empty file, did not populate. Windows10 using anaconda4 environment.Magdalenemagdalenian
Didn't work for me. Shell froze or very long process.Caren
This won't work and fail with ParseError exception when loading with pd.read_csvPluckless
W
2

You can also create a sample with the 10000 records before bringing it into the Python environment.

Using Git Bash (Windows 10) I just ran the following command to produce the sample

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

To note: If your CSV has headers this is not the best solution.

Womanhater answered 3/11, 2019 at 21:32 Comment(0)
B
2

TL;DR

If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 10000
batch_size = 200

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

Explanation

It's not always trivial to know the size of the input CSV file.

If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.

So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.

To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

First, we get the initial sample:

sample = sample_reader.get_chunk(sample_size)

Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))

And use this reindexed chunk to replace (some of the) lines in the sample:

sample.loc[chunk.index] = chunk

After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.

To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).

Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.

On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.

Buddybuderus answered 6/5, 2020 at 9:27 Comment(0)
I
1

read the data file

import pandas as pd
df = pd.read_csv('data.csv', 'r')

First check the shape of df

df.shape()

create the small sample of 1000 raws from df

sample_data = df.sample(n=1000, replace='False')

#check the shape of sample_data

sample_data.shape()
Irrefragable answered 12/7, 2020 at 23:24 Comment(0)
I
0

For example, you have the loan.csv, you can use this script to easily load the specified number of random items.

data = pd.read_csv('loan.csv').sample(10000, random_state=44)
Incorporation answered 20/3, 2020 at 1:0 Comment(0)
L
-2

Let's say that you want to load a 20% sample of the dataset:

    import pandas as pd
    df = pd.read_csv(filepath).sample(frac = 0.20)
Lanettelaney answered 9/4, 2020 at 15:23 Comment(1)
Question is about sampling from big file so this solution is not much useful, because pd.read_csv will load all file rows to memory.Oligoclase

© 2022 - 2024 — McMap. All rights reserved.