Convert Pandas dataframe from/to ORC file
Asked Answered
R

3

9

Is it possible to convert a Pandas dataframe from/to an ORC file? I can transform the df in a parquet file, but the library doesn't seem to have ORC support. Is there an available solution in Python? If not, what could be the best strategy? One option could be converting the parquet file to ORC using an external tool, but I have no clue where to find it.

Reign answered 6/11, 2019 at 11:2 Comment(2)
Are you using Hive or Spark (or both)? It is much easier to do what you are trying to do if you have one of those, without errors. In particular, I strongly suggest you use Hive to manage your ORC files. You can connect to it in python by using pyodbc or pyhive packages.Geddes
@Reign I have just finished the ORC adapter in C++ and Python so it is possible to write ORC files now if you use my fork: github.com/mathyingzhou/arrow.Condescension
I
7

This answer is tested with pyarrow==4.0.1 and pandas==1.2.5.

It first creates a pyarrow table using pyarrow.Table.from_pandas. It then writes the orc file using pyarrow.orc.ORCFile.

Read orc

import pandas as pd
import pyarrow.orc  # This prevents: AttributeError: module 'pyarrow' has no attribute 'orc'

df = pd.read_orc('/tmp/your_df.orc')

Write orc

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc

# Here prepare your pandas df.

table = pa.Table.from_pandas(df, preserve_index=False)
orc.write_table(table, '/tmp/your_df.orc')

As of pandas==1.3.0, there isn't a pd.to_orc writer yet.

Intermarry answered 16/7, 2021 at 22:20 Comment(1)
Do you have any idea if is possible to add compression type while writing ORC file using your described solution?Arella
H
5

To add to the answer above, Pandas v1.5.0 natively supports writing to ORC files. I'll update this with more documentation when it's released.

my_df.to_orc('myfile.orc')

Howey answered 7/6, 2022 at 16:28 Comment(0)
G
0

I have used pyarrow recently which has ORC support, although I've seen a few issues where the pyarrow.orc module is not being loaded.

pip install pyarrow

to use:

import pandas as pd
import pyarrow.orc as orc

with open(filename) as file:
    data = orc.ORCFile(file)
    df = data.read().to_pandas()
Gaul answered 15/11, 2019 at 21:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.