How to read an ORC file stored locally in Python Pandas?

Asked 19/10, 2018 at 9:33 Answered 24/3, 2022 at 11:46

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?

The filename is someFile.snappy.orc

I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.

Dreamy answered 19/10, 2018 at 9:33 Comment(0)

I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.

import pandas as pd
import pyarrow.orc as orc

with open(filename, 'rb') as file:
    data = orc.ORCFile(file)
    df = data.read().to_pandas()

Clareta answered 4/12, 2018 at 16:42 Comment(5)

In my case I needed with open(filename, 'rb') as file: to avoid the decoding error pyarrow.lib.ArrowIOError: Arrow error: IOError: 'utf-8' codec can't decode byte 0xfe in position 11: invalid start byte. – Yaakov 18/1, 2019 at 23:4

pyarrow works very well with Parquet but with ORC there seems to be some issues. – Shufu 24/1, 2019 at 13:12

@Yaakov you should open the file with the 'rb' mode instead – Orthman 28/3, 2019 at 9:15

why does pyarrow not have module orc? Has that changed? @Rafal Janik – Litch 4/11, 2019 at 19:56

Upon restarting a sagemaker instance, I also found the pyarrow._orc module to be missing. It was working before.

ModuleNotFoundError                       Traceback (most recent call last) <ipython-input-17-07bf84f8f5db> in <module>()       1 get_ipython().system('pip install pyarrow') ----> 2 from pyarrow import  orc  ~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/orc.py in <module>()      23 from pyarrow import types      24 from pyarrow.lib import Schema ---> 25 import pyarrow._orc as _orc      26       27   ModuleNotFoundError: No module named 'pyarrow._orc'

– Tegan 15/11, 2019 at 20:57

In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame

import findspark
from pyspark.sql import SparkSession

findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()

Discourse answered 9/7, 2019 at 12:12 Comment(0)

Starting from Pandas 1.0.0, there is a built in function for Pandas.

https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html

import pandas as pd
import pyarrow.orc 

df = pd.read_orc('/tmp/your_df.orc')

Be sure to read this warning about dependencies. This function might not work on Windows https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc

If you want to use read_orc(), it is highly recommended to install pyarrow using conda

Chiachiack answered 30/8, 2021 at 22:4 Comment(0)

Easiest way is using pyorc:

import pyorc
import pandas as pd

with open(r"my_orc_file.orc", "rb") as orc_file:
    reader = pyorc.Reader(orc_file)
    orc_data = reader.read()
    orc_schema = reader.schema

columns = list(orc_schema.fields)
df = pd.DataFrame(data=orc_data, columns=columns)

Transfuse answered 24/3, 2022 at 11:46 Comment(0)

ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).

Using pyspark should be easy reading an orc file, as soon as your environment grants the Hive support. Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):

Loads ORC files, returning the result as a DataFrame.

Note: Currently ORC support is only available together with Hive support.

>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')

Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).

Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()

Biparty answered 19/10, 2018 at 9:59 Comment(2)

My example works with Spark, please note that Pandas is a different library (even if they both have their own DataFrame implementation, this makes confusion I guess). Spark is designed to work in a distributed way, Pandas for analysis on a single PC. – Biparty 19/10, 2018 at 10:3

Spark has some overhead as it needs to create a context (and pyspark is a large binary). I did this before, but do not recommend if other options are available. – Goodden 27/4, 2019 at 7:55

I did not want to submit a spark job to read local ORC files or have pandas. This worked for me.

import pyarrow.orc as orc
data_reader = orc.ORCFile("/path/to/orc/part_file.zstd.orc")
data = data_reader.read()
source = data.to_pydict()

Richert answered 31/1, 2022 at 16:56 Comment(0)

Recommended topics

Hot tags