Read last N rows of S3 parquet table
Asked Answered
D

2

4

If I apply what was discussed here to read parquet files in an S3 buck to pandas dataframe, particularly:

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()

When the table grows larger and larger as time goes by and I need to make this retrieval regularly, I want to just read the last N rows into the data frame. Is this possible?

Donor answered 20/6, 2021 at 5:41 Comment(1)
Please see this SO entry for an example using awswranglerDiacritic
E
4

Yes, this is entirely possible. S3 allows for partial object reads. Parquet files allow for partial reads based on row groups (and pyarrow exposes this capability). In addition, pyarrow allows for partial reads if you have multiple files (regardless of file format). However, these approaches will put some requirements on how the input file(s) are created (see aside at the bottom).

The easy way

The easiest thing will be to use the newer datasets API (which is worth a read on its own and obsoletes some of the question you referenced) and filter on some kind of column.

import pyarrow.dataset as ds
from datetime import datetime, timedelta

two_days_ago = datetime.now() - timedelta(days=2)
dataset = ds.dataset('s3://your-bucket').to_table(filter=ds.field('sample_date') > two_days_ago)

The pyarrow datasets API supports "push down filters" which means that the filter is pushed down into the reader layer. If the reader is capable of reducing the amount of data read using the filter then it will. For simple filters like this the parquet reader is capable of optimizing reads by looking first at the row group metadata which should have a "statistics" section which contains the min/max for each column.

However, that isn't quite "the last N rows" as it requires you to craft some kind of filter. If you have complete control over the data then you could create a row_num column. You could then create a filter on that if you knew the total # of rows (which you could store separately or access via the metadata, see below.)

The slightly less easy way

Alternatively, you can use ParquetFile which has the metadata attribute. Accessing this will only trigger a read for the metadata itself (which is not the whole file). From this you can get some information such as how many row groups are in the file and how many rows they contain. You can use this to determine how many rows groups you need and you can use read_row_group or read_row_groups to access just those row groups (this will not trigger a full file read).

Neither of these solutions is ideal. The first option requires you to have more control over the data and the second option requires you to do quite a bit of work yourself. The Arrow project is working towards simplifying this sort of operation (see, for example, ARROW-3705 ). However, this answer is based only on features that are available today.

One last aside: All of these approaches (and even any future approaches developed by Arrow) will require the data to either be stored as multiple files or multiple row groups. If you have one giant file stored as a single row group then there is not much that can be done. Parquet does not support partial row group reads.

Essary answered 20/6, 2021 at 19:51 Comment(1)
This is a good start, but doesn't quite give a complete solution to the N rows problem. Any advances since last year? :)Ugrian
O
-2

No it is not possible solely with S3. S3 is an object store, which let's you store, retrieve, update etc only the 'whole' objects i.e. files.

Having said that you should have a look a Athena which is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using Standard SQL. It should let you do what you want.

Best, Stefan

Ossy answered 20/6, 2021 at 13:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.