Select every nth row as a Pandas DataFrame without reading the entire file
Asked Answered
P

4

9

I am reading a large file that contains ~9.5 million rows x 16 cols.

I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.

I am able to load the data, and then select every 500th row.

My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):

VendorID    tpep_pickup_datetime    tpep_dropoff_datetime   passenger_count trip_distance   RatecodeID  store_and_fwd_flag  PULocationID    DOLocationID    payment_type    fare_amount extra   mta_tax tip_amount  tolls_amount    improvement_surcharge   total_amount
0   1   2017-01-09 11:13:28 2017-01-09 11:25:45 1   3.30    1   N   263 161 1   12.5    0.0 0.5 2.00    0.00    0.3 15.30
1   1   2017-01-09 11:32:27 2017-01-09 11:36:01 1   0.90    1   N   186 234 1   5.0 0.0 0.5 1.45    0.00    0.3 7.25
2   1   2017-01-09 11:38:20 2017-01-09 11:42:05 1   1.10    1   N   164 161 1   5.5 0.0 0.5 1.00    0.00    0.3 7.30
3   1   2017-01-09 11:52:13 2017-01-09 11:57:36 1   1.10    1   N   236 75  1   6.0 0.0 0.5 1.70    0.00    0.3 8.50
4   2   2017-01-01 00:00:00 2017-01-01 00:00:00 1   0.02    2   N   249 234 2   52.0    0.0 0.5 0.00    0.00    0.3 52.80
Posology answered 17/12, 2018 at 9:17 Comment(2)
For the first question, I believe the skiprows argument of read_csv will come handyLemma
pd.read_csv(..., skiprows=lambda i: i % n)Thibault
I
10

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:

n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Islander answered 17/12, 2018 at 9:30 Comment(2)
quite useful. To reduce the amount of records by less than one half, the deletion step could occur more than onceSeadon
np.delete() generated an error when I tried to execute the code: IndexError: arrays used as indices must be of integer (or boolean) typeBubaline
U
3

Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?

First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:

#https://mcmap.net/q/45407/-how-to-get-the-line-count-of-a-large-file-cheaply-in-python
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_of_file = file_len('test.csv')
print (len_of_file)

skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)

df = pd.read_csv('test.csv', skiprows=skipped)

How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.

The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len_of_file = file_len('test.csv')

df1 = pd.read_csv('test.csv',
                  usecols=['tpep_pickup_datetime'],
                  parse_dates=['tpep_pickup_datetime'])

sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
                 .iloc[np.arange(0,len_of_file,500)].index)

skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)

df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
Unwrap answered 17/12, 2018 at 9:31 Comment(0)
T
3

use a lambda with skiprows:

pd.read_csv(path, skiprows=lambda i: i % N)

to skip every N rows.

source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Thibault answered 28/9, 2022 at 20:54 Comment(0)
A
0

You can use csv module return a iterator and use itertools.cycle to select every nth row.

import csv
from itertools import cycle

source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size))) 
with open(source_file) as f1:
    rdr = csv.reader(f1) 
    data = [row for pick, row in zip(chooser, rdr) if pick]
Assent answered 27/3, 2022 at 15:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.