TLDR: My Question & Problem
- Using SAM, I want develop and test lambda functions that uses NLTK locally on my machine.
- I created a lambda layer to hold the NLTK data files that is required for NLTK functions that I want to run.
- I created a Lambda function that calls
word_tokenize
which requires thepunkt
data file to run. - My expectation is that the lambda function will get the punkt data files from the lambda layer I created.
- However, when I run the lambda function on my local machine, the lambdas are not getting the NLTK data files from the layer I created and says data files are not found.
- What am I missing here? Is it folder structure of my layer or my yml file is wrong? Please help!
This is the error thrown when I sam build
and sam local invoke
my lambda function:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/share/nltk_data'
- '/var/lang/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
...more errors (see below for details)
Please help!
Background
- Using SAM to make lambda code
- Python 3.9
- Using NLTK to tokenize a string of words.
- Windows machine
Objective:
- Create and deploy a lambda function that can tokenize a string of words
What I did
I have already done the sam init.
Things I did:
- create the lambda function "plag_check" which does tokenizer
- create the lambda layer "nltkDataFilesLayer" which holds the NLTK data files needed for NLTK tokenizer
- adjusted my template.yml
- I build with
sam build
- I test code on my local machine with docker with
sam local invoke plag_check --event events/event.json
1. plag_check
For plag_check, I made a folder "plag_check" and inside, I have app.py
plag_check/app.py:
import json
from nltk.tokenize import word_tokenize;
def lambda_handler(event, context):
text = "Hello, how are you doing?"
tokens = word_tokenize(text)
print(tokens)
return {
"statusCode": 200,
"body": json.dumps({
"message": "inbound",
"text":text,
"tokens": tokens,
}),
}
2. nltkDataFilesLayer
I have a folder in my SAM project "nltkDataFilesLayer". Its contents:
- nltkDataFilesLayer/
- nltk_data/
- corpora/
- wordnet.zip
- taggers/
- averaged_perceptron_tagger/
- averaged_perceptron_tagger.zip
- tokenizers/
- punkt/
- punkt.zip
It just has the nltk_data which is generated by nltk.download
on my local machine.
DETAILS: How I made the nltkDataFilesLayer folder
- I ran a python script file that calls
nltk.download
- nltk downloads the files to my AppData/Roaming/nltk_data folder on my windows machine
- I copied the entire nltk_data folder.
- I change directory to sam project and created a folder "nltkDataFilesLayer"
- I paste the nltk_data folder in the newly created "nltkDataFilesLayer"
3. My template.yaml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: >
plag1
Sample SAM Template for plag1
Globals:
Function:
Timeout: 60
MemorySize: 512
Resources:
nltkDataFilesLayer:
Type: AWS::Serverless::LayerVersion
Properties:
LayerName: nltkDataFilesLayer
Description: Data files for nltk
ContentUri: ./nltkDataFilesLayer
CompatibleRuntimes:
- python3.9
PCheck:
Type: AWS::Serverless::Function
Properties:
CodeUri: plag_check/
Handler: app.lambda_handler
Runtime: python3.9
Layers:
- !Ref nltkDataFilesLayer
Architectures:
- x86_64
4. I build my code sam build
no problems here
5. I test my code locally
I ran sam local invoke plag_check --event events/event.json
ERROR!
[ERROR] LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/share/nltk_data'
- '/var/lang/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
raise LookupError(resource_not_found), in findnpickle")sent_tokenizee)
END RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7
REPORT RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7 Init Duration: 0.56 ms Duration: 6051.22 ms Billed Duration: 6052 ms Memory Size: 512 MB Max Memory Used: 512 MB
{"errorMessage": "\n**********************************************************************\n Resource \u001b[93mpunkt\u001b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001b[31m>>> import nltk\n >>> nltk.download('punkt')\n \u001b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \u001b[93mtokenizers/punkt/PY3/english.pickle\u001b[0m\n\n Searched in:\n - '/root/nltk_data'\n - '/var/lang/nltk_data'\n - '/var/lang/share/nltk_data'\n - '/var/lang/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n - ''\n**********************************************************************\n", "errorType": "LookupError", "requestId": "188fc3e0-f33c-4f40-be51-a6a631ac53b7", "stackTrace": [" File \"/var/task/app.py\", line 8, in lambda_handler\n tokens = word_tokenize(text)\n", " File \"/var/task/nltk/tokenize/__init__.py\", line 129, in word_tokenize\n sentences = [text] if preserve_line else sent_tokenize(text, language)\n", " File \"/var/task/nltk/tokenize/__init__.py\", line 106, in sent_tokenize\n tokenizer = load(f\"tokenizers/punkt/{language}.pickle\")\n", " File \"/var/task/nltk/data.py\", line 750, in load\n opened_resource = _open(resource_url)\n", " File \"/var/task/nltk/data.py\", line 876, in _open\n return find(path_, path + [\"\"]).open()\n", " File \"/var/task/nltk/data.py\", line 583, in find\n raise LookupError(resource_not_found)\n"]}