Yes, this is entirely possible. S3 allows for partial object reads. Parquet files allow for partial reads based on row groups (and pyarrow exposes this capability). In addition, pyarrow allows for partial reads if you have multiple files (regardless of file format). However, these approaches will put some requirements on how the input file(s) are created (see aside at the bottom).
The easy way
The easiest thing will be to use the newer datasets API (which is worth a read on its own and obsoletes some of the question you referenced) and filter on some kind of column.
import pyarrow.dataset as ds
from datetime import datetime, timedelta
two_days_ago = datetime.now() - timedelta(days=2)
dataset = ds.dataset('s3://your-bucket').to_table(filter=ds.field('sample_date') > two_days_ago)
The pyarrow datasets API supports "push down filters" which means that the filter is pushed down into the reader layer. If the reader is capable of reducing the amount of data read using the filter then it will. For simple filters like this the parquet reader is capable of optimizing reads by looking first at the row group metadata which should have a "statistics" section which contains the min/max for each column.
However, that isn't quite "the last N rows" as it requires you to craft some kind of filter. If you have complete control over the data then you could create a row_num
column. You could then create a filter on that if you knew the total # of rows (which you could store separately or access via the metadata, see below.)
The slightly less easy way
Alternatively, you can use ParquetFile
which has the metadata
attribute. Accessing this will only trigger a read for the metadata itself (which is not the whole file). From this you can get some information such as how many row groups are in the file and how many rows they contain. You can use this to determine how many rows groups you need and you can use read_row_group
or read_row_groups
to access just those row groups (this will not trigger a full file read).
Neither of these solutions is ideal. The first option requires you to have more control over the data and the second option requires you to do quite a bit of work yourself. The Arrow project is working towards simplifying this sort of operation (see, for example, ARROW-3705 ). However, this answer is based only on features that are available today.
One last aside: All of these approaches (and even any future approaches developed by Arrow) will require the data to either be stored as multiple files or multiple row groups. If you have one giant file stored as a single row group then there is not much that can be done. Parquet does not support partial row group reads.
awswrangler
– Diacritic