NLTK Data Files NOT FOUND when using AWS-lambda-layer to hold data files
Asked Answered
C

1

0

TLDR: My Question & Problem

  • Using SAM, I want develop and test lambda functions that uses NLTK locally on my machine.
  • I created a lambda layer to hold the NLTK data files that is required for NLTK functions that I want to run.
  • I created a Lambda function that calls word_tokenize which requires the punkt data file to run.
  • My expectation is that the lambda function will get the punkt data files from the lambda layer I created.
  • However, when I run the lambda function on my local machine, the lambdas are not getting the NLTK data files from the layer I created and says data files are not found.
  • What am I missing here? Is it folder structure of my layer or my yml file is wrong? Please help!

This is the error thrown when I sam build and sam local invoke my lambda function:

**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/root/nltk_data'
    - '/var/lang/nltk_data'
    - '/var/lang/share/nltk_data'
    - '/var/lang/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
...more errors (see below for details)

Please help!

Background

  • Using SAM to make lambda code
  • Python 3.9
  • Using NLTK to tokenize a string of words.
  • Windows machine

Objective:

  • Create and deploy a lambda function that can tokenize a string of words

What I did

I have already done the sam init.

Things I did:

  1. create the lambda function "plag_check" which does tokenizer
  2. create the lambda layer "nltkDataFilesLayer" which holds the NLTK data files needed for NLTK tokenizer
  3. adjusted my template.yml
  4. I build with sam build
  5. I test code on my local machine with docker with sam local invoke plag_check --event events/event.json

1. plag_check

For plag_check, I made a folder "plag_check" and inside, I have app.py

plag_check/app.py:

import json
from nltk.tokenize import word_tokenize;

def lambda_handler(event, context):

  text = "Hello, how are you doing?"
  tokens = word_tokenize(text)
  print(tokens)

  return {
    "statusCode": 200,
    "body": json.dumps({
        "message": "inbound",
        "text":text,
        "tokens": tokens,        
    }),
  }

2. nltkDataFilesLayer

I have a folder in my SAM project "nltkDataFilesLayer". Its contents:

- nltkDataFilesLayer/
  - nltk_data/
    - corpora/
      - wordnet.zip
    - taggers/
      - averaged_perceptron_tagger/
      - averaged_perceptron_tagger.zip
    - tokenizers/
      - punkt/
      - punkt.zip

It just has the nltk_data which is generated by nltk.download on my local machine.

DETAILS: How I made the nltkDataFilesLayer folder

  • I ran a python script file that calls nltk.download
  • nltk downloads the files to my AppData/Roaming/nltk_data folder on my windows machine
  • I copied the entire nltk_data folder.
  • I change directory to sam project and created a folder "nltkDataFilesLayer"
  • I paste the nltk_data folder in the newly created "nltkDataFilesLayer"

3. My template.yaml

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: >
  plag1

  Sample SAM Template for plag1

Globals:
  Function:
    Timeout: 60
    MemorySize: 512

Resources:
  nltkDataFilesLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      LayerName: nltkDataFilesLayer
      Description: Data files for nltk
      ContentUri: ./nltkDataFilesLayer
      CompatibleRuntimes:
        - python3.9

  PCheck:
    Type: AWS::Serverless::Function 
    Properties:
      CodeUri: plag_check/
      Handler: app.lambda_handler
      Runtime: python3.9
      Layers:
        - !Ref nltkDataFilesLayer
      Architectures:
        - x86_64

4. I build my code sam build

no problems here

5. I test my code locally

I ran sam local invoke plag_check --event events/event.json ERROR!

[ERROR] LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/root/nltk_data'
    - '/var/lang/nltk_data'
    - '/var/lang/share/nltk_data'
    - '/var/lang/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
    raise LookupError(resource_not_found), in findnpickle")sent_tokenizee)
END RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7
REPORT RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7  Init Duration: 0.56 ms  Duration: 6051.22 ms    Billed Duration: 6052 ms    Memory Size: 512 MB     Max Memory Used: 512 MB
{"errorMessage": "\n**********************************************************************\n  Resource \u001b[93mpunkt\u001b[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001b[31m>>> import nltk\n  >>> nltk.download('punkt')\n  \u001b[0m\n  For more information see: https://www.nltk.org/data.html\n\n  Attempted to load \u001b[93mtokenizers/punkt/PY3/english.pickle\u001b[0m\n\n  Searched in:\n    - '/root/nltk_data'\n    - '/var/lang/nltk_data'\n    - '/var/lang/share/nltk_data'\n    - '/var/lang/lib/nltk_data'\n    - '/usr/share/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/lib/nltk_data'\n    - '/usr/local/lib/nltk_data'\n    - ''\n**********************************************************************\n", "errorType": "LookupError", "requestId": "188fc3e0-f33c-4f40-be51-a6a631ac53b7", "stackTrace": ["  File \"/var/task/app.py\", line 8, in lambda_handler\n    tokens = word_tokenize(text)\n", "  File \"/var/task/nltk/tokenize/__init__.py\", line 129, in word_tokenize\n    sentences = [text] if preserve_line else sent_tokenize(text, language)\n", "  File \"/var/task/nltk/tokenize/__init__.py\", line 106, in sent_tokenize\n    tokenizer = load(f\"tokenizers/punkt/{language}.pickle\")\n", "  File \"/var/task/nltk/data.py\", line 750, in load\n    opened_resource = _open(resource_url)\n", "  File \"/var/task/nltk/data.py\", line 876, in _open\n    return find(path_, path + [\"\"]).open()\n", "  File \"/var/task/nltk/data.py\", line 583, in find\n    raise LookupError(resource_not_found)\n"]}
Cristencristi answered 28/6, 2023 at 5:18 Comment(0)
C
0

Finally, solution found. It was so simple!

Solution

Add nltk.data.path.append("/opt/nltk_data") to my lambda_handler of plag_check

New plag_check code:

import json
from nltk.tokenize import word_tokenize;

def lambda_handler(event, context):

  nltk.data.path.append("/opt/nltk_data") # this one line fixed it!

  text = "Hello, how are you doing?"
  tokens = word_tokenize(text)
  print(tokens)

  return {
    "statusCode": 200,
    "body": json.dumps({
        "message": "inbound",
        "text":text,
        "tokens": tokens,        
    }),
  }

NOTE: I put the "/opt/" before the nltk_data

Where I found the solution:

I thought about this after reading Accessing layer content from your function (AWS article)

It says:

If your Lambda function includes layers, Lambda extracts the layer contents into the /opt directory in the function execution environment.

This is one of the important things to know when working with Layers. All of the layer contents are kept in /opt folder. This is just how AWS does things in lambda.

Some other useful additional facts I also learned (some relevance)

From the article:

Lambda extracts the layers in the order (low to high) listed by the function.

Lambda merges folders with the same name, so if the same file appears in multiple layers, the function uses the version in the last extracted layer.

Cristencristi answered 28/6, 2023 at 6:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.