Read Parquet file stored in S3 with AWS Lambda (Python 3)

Asked 26/12, 2017 at 22:22 Answered 2/6, 2020 at 1:52

python amazon-s3 aws-lambda parquet pyarrow

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
Add a test python function to the zip, send it to S3, update the lambda and test it

It seems that there are two possible approaches, which both work locally to the docker container:

fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
- If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848. Locally I get IndexError: list index out of range in pyarrow/parquet.py, line 714
- If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same OSError: Passed non-file path: s3://mybucket/path/to/myfile in pyarrow/parquet.py, line 848.

My questions are :

why do I get a different result in my docker container than I do in the Lambda environment?
what is the proper way to give the URI?
is there an accepted way to read Parquet files in S3 through AWS Lambda?

Thanks!

Sixteenth answered 26/12, 2017 at 22:22 Comment(0)

AWS has a project (AWS Data Wrangler) that allows it with full Lambda Layers support.

In the Docs there is a step-by-step to do it.

Code example:

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Reference

Drive answered 10/1, 2020 at 13:47 Comment(0)

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.

Here's how I did it:

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2

Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:

sudo yum list | grep python3

I installed:

python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64

2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:

mkdir parquet
cd parquet
pip install -t . fastparquet 
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda

Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.

Source: Write parquet from AWS Kinesis firehose to AWS S3

Polychrome answered 8/6, 2018 at 19:57 Comment(1)

For many-linux variant of pyarrow, If you keep all the *16 files only, get rid of the other larger copy for each lib and then repackage them into a zip file, the lambda package size comes under 250 which the AWS lambda accepts – Gelsemium 20/2, 2020 at 9:30

This was an environment issue (Lambda in VPC not getting access to the bucket). Pyarrow is now working.
Hopefully the question itself will give a good-enough overview on how to make all that work.

Sixteenth answered 27/12, 2017 at 15:32 Comment(4)

Could you provide any more info on how you got this working? I keep running into an ImportError even though the pyarrow package is in my zip. It keeps saying pyarrow is required for parquet support I am running on Python 2.7 so that could be the issue. – Posturize 20/1, 2018 at 2:56

It might be, but it's difficult to diagnose without context. Your error does not look like an ModuleNotFoundError or ImportError though. I followed the links I gave to create my env. Roughly:

docker run -it lambci/lambda:build-python3.6 bash  mkdir lambda  cd lambda virtualenv ~/lambda source ~/lambda/bin/activate pip install pyarrow pip install pandas pip install s3fs cd $VIRTUAL_ENV/lib/python3.6/site-packages zip -r9 ~/lambda.zip . [get the lambda.zip locally] zip -ur ../lambda.zip lambda_function.py

– Sixteenth 22/1, 2018 at 22:37

Thanks for the reply @Ptah. I took a guess that it was an incompatibility with the Python 2.7 runtime in AWS Lambda and I was right. Once I upgraded the code to run on 3.6 and built the upgraded zip file pyarrow worked without a problem. Hopefully this helps others who run into this. – Posturize 24/1, 2018 at 1:11

Could you give us a sample of the code? I wonder how are you writing the files and using the s3fs object... I'm trying with pyarrow.parquet.write_table() but it's not happening for me. Thanks! – Polychrome 8/6, 2018 at 15:47

One can also achieve this through the AWS sam cli and Docker (we'll explain this requirement later).

1.Create a directory and initialize sam

mkdir some_module_layer
cd some_module_layer
sam init

by typing the last command a series of three question would be prompted. One could choose the following series of answers (I'm considering working under Python3.7, but other options are possible).

1 - AWS Quick Start Templates

8 - Python 3.7

Project name [sam-app]: some_module_layer

1 - Hello World Example

2. Modify requirements.txt file

cd some_module_layer
vim hello_world/requirements.txt

this will open requirements.txt file on vim, on Windows you could type instead code hello_world/requirements.txt to edit the file on Visual Studio Code.

3. Add pyarrow to requirements.txt

Alongside pyarrow, it will work to include additionnaly pandas and s3fs. In this case including pandas will avoid it to not recognize pyarrow as an engine to read parquet files.

pandas
pyarrow
s3fs

4. Build with a container

Docker is required to use the option --use-container when running the sam build command. If it's the first time, it will pull the lambci/lambda:build-python3.7 Docker image.

sam build --use-container
rm .aws-sam/build/HelloWorldFunction/app.py
rm .aws-sam/build/HelloWorldFunction/__init__.py
rm .aws-sam/build/HelloWorldFunction/requirements.txt

notice that we're keeping only the python libraries.

5. Zip files

cp -r .aws-sam/build/HelloWorldFunction/ python/
zip -r some_module_layer.zip python/

On Windows, it would work to run Compress-Archive python/ some_module_layer.zip.

6. Upload zip file to AWS

The following link is useful for this.

Beforehand answered 2/6, 2020 at 1:52 Comment(2)

Can you run these commands on mac OS or does this approach need to be run on a Linux machine? – Ergograph 10/9, 2020 at 20:26

@Powers, they work on Mac, Linux and Windows. Are you getting any errors? – Beforehand 10/9, 2020 at 21:4

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:

Recommended topics

Hot tags