Pyarrow read/write from s3

Asked 27/3, 2018 at 12:42 Answered 10/1, 2020 at 12:41

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow.

Here is my code:

import pyarrow.parquet as pq
import pyarrow as pa
import s3fs

s3 = s3fs.S3FileSystem()

bucket = 'demo-s3'

pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas()
table = pa.Table.from_pandas(pd)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')

Mortician answered 27/3, 2018 at 12:42 Comment(1)

Is there any reason not to use s3fs to copy the files? – Intend 26/6, 2018 at 16:51

If you do not wish to copy the files directly, it appears you can indeed avoid pandas thus:

table = pq.ParquetDataset('s3://{0}/old'.format(bucket),
    filesystem=s3).read(nthreads=4)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), 
    filesystem=s3, use_dictionary=True, compression='snappy')

Intend answered 26/6, 2018 at 16:56 Comment(0)

Why not just copy directly (S3 -> S3) and save memory and I/O?

import awswrangler as wr

SOURCE_PATH = "s3://..."
TARGET_PATH = "s3://..."

wr.s3.copy_objects(
    source_path=SOURCE_PATH,
    target_path=TARGET_PATH
)

Reference

Hesitate answered 10/1, 2020 at 12:41 Comment(0)

Recommended topics

Hot tags