How to install Poppler to be used on AWS Lambda
Asked Answered
D

5

14

I have to run pdf2image on my Python Lambda Function in AWS, but it requires poppler and poppler-utils to be installed on the machine.

I have tried to search in many different places how to do that but could not find anything or anyone that have done that using lambda functions.

Would any of you know how to generate poppler binaries, put it on my Lambda package and tell Lambda to use that?

Thank you all.

Delenadeleon answered 20/11, 2018 at 23:58 Comment(0)
S
4

AWS lambda runs under an execution environment which includes software and libraries if anything you need is not there you need to install it to create an execution environment.Check the below link for more info , https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

for poppler follow this steps to create your own binary https://github.com/skylander86/lambda-text-extractor/blob/master/BuildingBinaries.md

Sketchy answered 21/11, 2018 at 8:48 Comment(2)
Is there any implementation of this procedure? How to tell lambda to pick binaries?Tinstone
You won't have to tell lambda , only make sure you have the libraries imported in the code ,and then create the object and use themSketchy
T
5

I used the pre-built AWS Lambda layer https://github.com/jeylabs/aws-lambda-poppler-layer/releases and it worked!

You can use this solution if you just want to run the function, but If you want to specify the version and have more control, I'll recommend using the container image solution.

Thermy answered 25/4, 2022 at 6:35 Comment(0)
S
4

AWS lambda runs under an execution environment which includes software and libraries if anything you need is not there you need to install it to create an execution environment.Check the below link for more info , https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

for poppler follow this steps to create your own binary https://github.com/skylander86/lambda-text-extractor/blob/master/BuildingBinaries.md

Sketchy answered 21/11, 2018 at 8:48 Comment(2)
Is there any implementation of this procedure? How to tell lambda to pick binaries?Tinstone
You won't have to tell lambda , only make sure you have the libraries imported in the code ,and then create the object and use themSketchy
F
4

My approach was to use the AWS Linux 2 image as a base to ensure maximum compatibility with the Lambda environment, compile openjpeg and poppler in the container build and build a zip containing the binaries and libraries needed which can then by used as a layer.

This enables you to write your code in it's own lambda which pulls in the poppler dependencies as a layer, simplifying build and deployment.

The contents of the layer will be unpacked into /opt/. This means the contents will automatically be available because by default in the lambda environment

  • $PATH is /usr/local/bin:/usr/bin/:/bin:/opt/bin
  • $LD_LIBRARY_PATH is /lib64:/usr/lib64:$LAMBDA_RUNTIME_DIR:$LAMBDA_RUNTIME_DIR/lib:$LAMBDA_TASK_ROOT:$LAMBDA_TASK_ROOT/lib:/opt/lib

Dockerfile :

# https://www.petewilcock.com/using-poppler-pdftotext-and-other-custom-binaries-on-aws-lambda/

ARG POPPLER_VERSION="21.10.0"
ARG POPPLER_DATA_VERSION="0.4.11"
ARG OPENJPEG_VERSION="2.4.0"


FROM amazonlinux:2

ARG POPPLER_VERSION
ARG POPPLER_DATA_VERSION
ARG OPENJPEG_VERSION

WORKDIR /root

RUN yum update -y
RUN yum install -y \
   cmake \
   cmake3 \
   fontconfig-devel \
   gcc \
   gcc-c++ \
   gzip \
   libjpeg-devel \
   libpng-devel \
   libtiff-devel \
   make \
   tar \
   xz \
   zip

RUN curl -o poppler.tar.xz https://poppler.freedesktop.org/poppler-${POPPLER_VERSION}.tar.xz
RUN tar xf poppler.tar.xz
RUN curl -o poppler-data.tar.gz https://poppler.freedesktop.org/poppler-data-${POPPLER_DATA_VERSION}.tar.gz
RUN tar xf poppler-data.tar.gz
RUN curl -o openjpeg.tar.gz https://codeload.github.com/uclouvain/openjpeg/tar.gz/refs/tags/v${OPENJPEG_VERSION}
RUN tar xf openjpeg.tar.gz

WORKDIR poppler-data-${POPPLER_DATA_VERSION}
RUN make install

WORKDIR /root
RUN mkdir openjpeg-${OPENJPEG_VERSION}/build
WORKDIR openjpeg-${OPENJPEG_VERSION}/build
RUN cmake .. -DCMAKE_BUILD_TYPE=Release
RUN make
RUN make install

WORKDIR /root
RUN mkdir poppler-${POPPLER_VERSION}/build
WORKDIR poppler-${POPPLER_VERSION}/build
RUN cmake3 .. -DCMAKE_BUILD_TYPE=release -DBUILD_GTK_TESTS=OFF -DBUILD_QT5_TESTS=OFF -DBUILD_QT6_TESTS=OFF \
    -DBUILD_CPP_TESTS=OFF -DBUILD_MANUAL_TESTS=OFF -DENABLE_BOOST=OFF -DENABLE_CPP=OFF -DENABLE_GLIB=OFF \
    -DENABLE_GOBJECT_INTROSPECTION=OFF -DENABLE_GTK_DOC=OFF -DENABLE_QT5=OFF -DENABLE_QT6=OFF \
    -DENABLE_LIBOPENJPEG=openjpeg2 -DENABLE_CMS=none  -DBUILD_SHARED_LIBS=OFF
RUN make
RUN make install


WORKDIR /root
RUN mkdir -p package/{lib,bin,share}
RUN cp -d /usr/lib64/libexpat* package/lib
RUN cp -d /usr/lib64/libfontconfig* package/lib
RUN cp -d /usr/lib64/libfreetype* package/lib
RUN cp -d /usr/lib64/libjbig* package/lib
RUN cp -d /usr/lib64/libjpeg* package/lib
RUN cp -d /usr/lib64/libpng* package/lib
RUN cp -d /usr/lib64/libtiff* package/lib
RUN cp -d /usr/lib64/libuuid* package/lib
RUN cp -d /usr/lib64/libz* package/lib
RUN cp -rd /usr/local/lib/* package/lib
RUN cp -rd /usr/local/lib64/* package/lib
RUN cp -d /usr/local/bin/* package/bin
RUN cp -rd /usr/local/share/poppler package/share

WORKDIR package
RUN zip -r9 ../package.zip *

And to run...

docker build -t poppler .
docker run --name poppler -d -t poppler cat
docker cp poppler:/root/package.zip .

Then upload package.zip as a layer using the console or aws cli.

Faience answered 21/10, 2021 at 8:35 Comment(2)
The accepted answer unfortunately uses version 0.59 from 2017, and does not work for more recent versions. This answer works if the following line is added to the copy statements: RUN cp -d /usr/lib64/libbz* package/libVigilante
For some PDFs some text is not rendered which is displayed in other PDF viewers, probably a fonts issue. It works with github.com/jeylabs/aws-lambda-poppler-layer thoughVigilante
S
1

Straightforward Build Instructions for Poppler on Lambda using Docker

In order to put Poppler on Lambda, we will build a zipped folder containing poppler and add it as a layer. Follow these steps on an EC2 instance running Amazon Linux 2 (t2micro is plenty).

  1. Setup the machine

Install docker on the EC2 machine. Instructions here

mkdir -p poppler_binaries
  1. Create a Dockerfile

Use this link or copy/paste from below.

FROM ubuntu:18.04

# Installing dependencies
RUN apt update
RUN apt-get update
RUN apt-get install -y locate \
                       libopenjp2-7 \
                       poppler-utils

RUN rm -rf /poppler_binaries;  mkdir /poppler_binaries;
RUN updatedb
RUN cp $(locate libpoppler.so) /poppler_binaries/.
RUN cp $(which pdftoppm) /poppler_binaries/.
RUN cp $(which pdfinfo) /poppler_binaries/.
RUN cp $(which pdftocairo) /poppler_binaries/.
RUN cp $(locate libjpeg.so.8 ) /poppler_binaries/.
RUN cp $(locate libopenjp2.so.7 ) /poppler_binaries/.
RUN cp $(locate libpng16.so.16 ) /poppler_binaries/.
RUN cp $(locate libz.so.1 ) /poppler_binaries/.
  1. Build Docker Image and create a zip file

Running the commands below will produce a zip file in your home directory.

docker build -t poppler-build .
# Run the container
docker run -d --name poppler-build-cont poppler-build sleep 20 
#docker exec poppler-build-cont 
sudo docker cp poppler-build-cont:/poppler_binaries .
# Cleaning up
docker kill poppler-build-cont
docker rm poppler-build-cont
docker image rm poppler-build
cd poppler_binaries
zip -r9 ..poppler.zip .
cd ..
  1. Make and add your Lambda Layer

Download your zip file or upload it to S3. Head to the Lambda Console page to create a Layer and then add it to your function. Information about layers here.

  1. Add Environment Variable to Lambda

In order to avoid adding unnecessary folder structure to the zip as described here. We will add an environment variable to point to our dependency

PYTHONPATH: /opt/

And Viola! You now have a working Lambda function with Poppler!

Note: Credit to these two articles which helped me piece this together

Warning: do not try to add pdf2image to the same layer. I am not sure why but when they are in the same layer, pdf2image cannot find poppler.

Susan answered 3/4, 2020 at 14:30 Comment(3)
Hi @Alex Albracht, thanks for the reply...I will try to run this just so I can accept this as an answer. Really detailed! Thanks for that!Delenadeleon
@Delenadeleon that'd be awesome! Tried my best to proofread and test it, let me know if you run into any issues or errorsSusan
This is a good baseline, but I had to make a lot of changes to make it work for me. Lambda uses Yum now for example, and instead of libjpeg.so.8 I just went with libjpeg.so, things like that. And added additional binaries. But it gave me an outline, so thanks.Sumo
F
1

Hi @Alex Albracht thanks for compiled easy instructions! They helped a lot. But I really struggled with getting the lambda function find the poppler path. So, I'll try to add that up with an effort to make it clear.

The binary files should go in a zip folder having structure as: poppler.zip -> bin/poppler where poppler folder contains the binary files. This zip folder can be then uploaded as a layer in AWS lambda.

For pdf2image to work, it needs poppler path. This should be included in the lambda function in the format - "/opt/bin/poppler".

For example, poppler_path = "/opt/bin/poppler" pages = convert_from_path(PDF_file, 500, poppler_path=poppler_path)

Faires answered 8/7, 2020 at 11:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.