I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
- https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others).
- This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
- Add a test python function to the zip, send it to S3, update the lambda and test it
It seems that there are two possible approaches, which both work locally to the docker container:
- fastparquet with s3fs: Unfortunately the unzipped size of the package is bigger than 256MB and therefore I can't update the Lambda code with it.
pyarrow with s3fs: I followed https://github.com/apache/arrow/pull/916 and when executed with the lambda function I get either:
- If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment
OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848. Locally I getIndexError: list index out of range
in pyarrow/parquet.py, line 714 - If I don't prefix the URI with S3 or S3N: It works locally (I can read the parquet data). In the Lambda environment, I get the same
OSError: Passed non-file path: s3://mybucket/path/to/myfile
in pyarrow/parquet.py, line 848.
- If I prefix the URI with S3 or S3N (as in the code example): In the Lambda environment
My questions are :
- why do I get a different result in my docker container than I do in the Lambda environment?
- what is the proper way to give the URI?
- is there an accepted way to read Parquet files in S3 through AWS Lambda?
Thanks!