How to read parquet file with a condition using pyarrow in Python
Asked Answered
S

2

12

I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyodbc


def write_to_parquet(df, out_path, compression='SNAPPY'):
    arrow_table = pa.Table.from_pandas(df)
    if compression == 'UNCOMPRESSED':
        compression = None
    pq.write_table(arrow_table, out_path, use_dictionary=False,
                   compression=compression)


def read_pyarrow(path, nthreads=1):
    return pq.read_table(path, nthreads=nthreads).to_pandas()


path = './test.parquet'
sql = "SELECT * FROM [dbo].[Book] (NOLOCK)"

conn = pyodbc.connect(r'Driver={SQL Server};Server =.;Database = APP_BBG_RECN;Trusted_Connection = yes;')

df = pd.io.sql.read_sql(sql, conn)

write_to_parquet(df, path)

df1 = read_pyarrow(path)

How can I put a condition (title='Learn Python') in read_pyarrow method?

Sylvie answered 9/2, 2018 at 22:6 Comment(0)
E
5

Filters are now available read_table

table = pq.read_table(
        df, filters=[("title", "in", {'Learn Python'}), 
                     ("year", ">=", 1950)]
    )
 
Edmund answered 27/1, 2021 at 16:37 Comment(0)
D
4

This is not yet supported. We intend to develop this functionality in the future. I recommend doing the filtering with pandas after the conversion from Arrow table.

Durrett answered 15/2, 2018 at 22:48 Comment(2)
Any update? What is the current state, on 2018-07-22?Episcopalism
arrow.apache.org/docs/python/generated/… See filter section. Haven't tried it myself.Cheriecherilyn

© 2022 - 2024 — McMap. All rights reserved.