Use pyarrow in Glue pythonshell - ModuleNotFoundError: No module named 'pyarrow.lib'
Asked Answered
M

2

3

Created a egg and whl file of pyarrow and put this on s3, for call this in pythonshell job. Received this message:

Job code:

import pyarrow
raise

Error, same structure for whl:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 118, in <module>
    runpy.run_path(temp_file_path, run_name='__main__')
  File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/glue-python-scripts-e67xuz2j/genos.py", line 1, in <module>
  File "/glue/lib/installation/kanna-0.1-py3.6.egg/pyarrow/__init__.py", line 49, in <module>
    from pyarrow.lib import cpu_count, set_cpu_count
ModuleNotFoundError: No module named 'pyarrow.lib'

PD: Cannot found the lib.py or lib folder in local files.

Minimus answered 3/3, 2020 at 17:47 Comment(3)
pyarrow doesn't work with egg files, either use a wheel or a sdist of it.Flowered
Already tried with whl and the same Error.Minimus
I have the same exact issue. My Python 3.6 Egg has pyarrow installed, but I get this ModuleNotFoundError.Walkerwalkietalkie
C
2

I was having the same problem with AWS Lambda and came across this question.

For Glue, AWS docs state only pure python libraries can be used.AWS Glue docs sreenshot

For Lambda:

The underlying problem is that modules like pyarrow port their code from C/ C++. When you check pyarrow codebase, you will find in fact two pyarrow.lib files exist, but they have .pyx and .pxd file extensions. This is not pure Python code and therefore depends on underlying CPU architecture.

I had to manually download .whl files for my required version for pyarrow and its dependency numpy. From http://pypi.org/project/pyarrow/, click on Download files and search for your matching version. cp39 means cpython 3.9. and x86 represents the CPU architecture. Follow the same steps for Numpy. I ended up downloading these files: pyarrow-8.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl and numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

You then have to unzip them and create an archive where both sit together in a folder named Python. This folder can be used to create a layer in Lambda. Attach this layer to your project and import pyarrow should work.

The other solution is to use custom Docker images. This worked for me as well. I believe the AWS docs are exhaustive on that topic. I have written a PoC and all the steps that I followed here.

I followed this guide for creating a pyarrow layer.

Carlo answered 22/6, 2022 at 6:14 Comment(1)
This was helpful, but I was unable to add this layer along with AWSSDKPandas layer. Size exceeded. Any idea how we can just take parts of pyarrow? I just want to use read_feather and write_feather with lz4 compressionSamba
H
0

pyarrow won't work as it is with glue as it needs support for C and glue doesn't support it. What you could do is try installing the library on local machine and creating a package manually, then using that egg file. That worked for my colleage, haven't tested personally.

Highlight answered 11/8, 2020 at 15:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.