No module named 'pyarrow._orc'
Asked Answered
S

3

8

I have a problem using pyarrow.orc module in Anaconda on Windows 10.

import pyarrow.orc as orc

throws an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\apps\Anaconda3\envs\ws\lib\site-packages\pyarrow\orc.py", line 23, in <module>
    import pyarrow._orc as _orc
ModuleNotFoundError: No module named 'pyarrow._orc'

On the other hand: import pyarrow works without any issues.

conda list
# packages in environment at C:\apps\Anaconda3\envs\ws:
#
# Name                    Version                   Build  Channel
arrow-cpp                 0.13.0           py37h49ee12d_0
...
numpy                     1.17.3           py37h4ceb530_0
numpy-base                1.17.3           py37hc3f5095_0
...
pip                       19.3.1                   py37_0
pyarrow                   0.13.0           py37ha925a31_0
...
python                    3.7.5                h8c8aaf0_0
...

I've tried other versions of pyarrow with the same results.

conda -V
conda 4.7.12
Solvable answered 12/11, 2019 at 15:47 Comment(2)
Hi, I'm not sure this is limited to Windows 10, I am getting the same error in AWS Sagemaker in the last few days. This was working fine before, on a previous Sagemaker instance. The conda_python3 kernel had pyarrow 0.13.0 installed from repo.anaconda.com/pkgs/main/linux-64, build py36he6710b0_0.Erose
I'm currently getting this when I try to load a dask dataframe, python 3.7 OS X.Scrivens
P
6

The ORC reader is not supported at all on Windows and has never been to my knowledge. Apache ORC in C++ is not known to build yet with the Visual Studio C++ compiler.

Prepossess answered 21/11, 2019 at 4:37 Comment(2)
This problem is not Windows-only. I have had the same issue on a Mac.Multicolored
Posting for visibility: pyarrow.orc suppport was disabled in pip wheels because of linking issues and there's a ticket (ARROW-7811) Seeking community help for a fix. Installing a pre 0.15.0 version or via conda will work as a workaround for MacOS/LinuxSiward
E
5

Bottom line up front, I had the same error. This was the solution for me:

!pip install pyarrow==0.13.0

I'm not sure this is limited to Windows 10, I am getting the same error in AWS Sagemaker in the last few days. This was working fine before, on a previous Sagemaker instance.

Using the Conda Packages menu in Jupyter, the conda_python3 kernel showed it had pyarrow 0.13.0 installed from https://repo.anaconda.com/pkgs/main/linux-64, build py36he6710b0_0.

However a subsequent call to

!conda -list

Did not show pyarrow as being in the Jupyter conda_python3 kernel, even after restarting the kernel.

Normally in a Sagemaker [Jupyter notebook] instance, I would use !pip commands because they just seem to work better, and don't have the timeout errors I sometimes find with the Conda Packages menu. (Also I don't need to worry about passing -y flags, the installs just happen)

Normally !pip install pyarrow was working, but I noticed it was installing pyarrow 0.15.1 from Nov 1, 2019.

Perhaps there is an error in that version with loading the _orc package, or some other conflicting library.

My intuition is that something is wrong with the conda version of pyarrow 0.13.0, and with pyarrow 0.15.1.

In a Jupyter cell I tried this:

!pip uninstall pyarrow -y
!pip install pyarrow
from pyarrow import orc

Output:

Uninstalling pyarrow-0.15.1:
  Successfully uninstalled pyarrow-0.15.1
Collecting pyarrow
  Downloading https://files.pythonhosted.org/packages/6c/32/ce1926f05679ea5448fd3b98fbd9419d8c7a65f87d1a12ee5fb9577e3a8e/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl (59.2MB)
     |████████████████████████████████| 59.2MB 381kB/s  eta 0:00:01
Requirement already satisfied: numpy>=1.14 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pyarrow) (1.14.3)
Requirement already satisfied: six>=1.0.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pyarrow) (1.11.0)
Installing collected packages: pyarrow
Successfully installed pyarrow-0.15.1
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-6-36378dee5a25> in <module>()
      1 get_ipython().system('pip uninstall pyarrow -y')
      2 get_ipython().system('pip install pyarrow')
----> 3 from pyarrow import orc

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyarrow/orc.py in <module>()
     23 from pyarrow import types
     24 from pyarrow.lib import Schema
---> 25 import pyarrow._orc as _orc
     26 
     27 

ModuleNotFoundError: No module named 'pyarrow._orc'

Note that when you try to uninstall pyarrow 0.15.1 and install a specific older version, like 0.13.0, you should restart the kernel after uninstalling. There are some incompatible binaries that get left behind. I did not post that output because it was so long.

pip uninstall pyarrow -y

Restart Kernel, then:

!pip install pyarrow==0.13.0
from pyarrow import orc

Output:

Collecting pyarrow==0.13.0
  Using cached https://files.pythonhosted.org/packages/ad/25/094b122d828d24b58202712a74e661e36cd551ca62d331e388ff68bae91d/pyarrow-0.13.0-cp36-cp36m-manylinux1_x86_64.whl
Requirement already satisfied: numpy>=1.14 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pyarrow==0.13.0) (1.14.3)
Requirement already satisfied: six>=1.0.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pyarrow==0.13.0) (1.11.0)
Installing collected packages: pyarrow
Successfully installed pyarrow-0.13.0

There is now no error from the import command, and orc files can be read again.

Erose answered 18/11, 2019 at 17:5 Comment(0)
D
3

Below Code solve my issue on windows. Need to install pyorc . It worked well with plain python ,no conda needed. Please refer video https://www.youtube.com/watch?v=qvV_Frc6zB8 by Alfred Zhong. He nicely explain it.

import pyorc
import pandas as pd 
data = open('file1.orc','rb')
reader = pyorc.Reader(data)
columns = reader.schema.fields
columns =[col_name for col_idx,col_name in sorted(
[
    (reader.schema.find_column_id(c), c) for c in columns   
]
)]
df=pd.DataFrame(reader,columns=columns)
df.to_csv('file1.csv')
Durnan answered 23/2, 2022 at 7:33 Comment(1)
This is actually the only solution I found for win10.x64, pyrarrow 8.0.0, python 3.10.xWillpower

© 2022 - 2024 — McMap. All rights reserved.