TLDR: My Question & Problem

Using SAM, I want develop and test lambda functions that uses NLTK locally on my machine.
I created a lambda layer to hold the NLTK data files that is required for NLTK functions that I want to run.
I created a Lambda function that calls word_tokenize which requires the punkt data file to run.
My expectation is that the lambda function will get the punkt data files from the lambda layer I created.
However, when I run the lambda function on my local machine, the lambdas are not getting the NLTK data files from the layer I created and says data files are not found.
What am I missing here? Is it folder structure of my layer or my yml file is wrong? Please help!

This is the error thrown when I sam build and sam local invoke my lambda function:

**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/root/nltk_data'
    - '/var/lang/nltk_data'
    - '/var/lang/share/nltk_data'
    - '/var/lang/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
...more errors (see below for details)

Please help!

Background

Using SAM to make lambda code
Python 3.9
Using NLTK to tokenize a string of words.
Windows machine

Objective:

Create and deploy a lambda function that can tokenize a string of words

What I did

I have already done the sam init.

Things I did:

create the lambda function "plag_check" which does tokenizer
create the lambda layer "nltkDataFilesLayer" which holds the NLTK data files needed for NLTK tokenizer
adjusted my template.yml
I build with sam build
I test code on my local machine with docker with sam local invoke plag_check --event events/event.json

1. plag_check

For plag_check, I made a folder "plag_check" and inside, I have app.py

plag_check/app.py:

import json
from nltk.tokenize import word_tokenize;

def lambda_handler(event, context):

  text = "Hello, how are you doing?"
  tokens = word_tokenize(text)
  print(tokens)

  return {
    "statusCode": 200,
    "body": json.dumps({
        "message": "inbound",
        "text":text,
        "tokens": tokens,        
    }),
  }

2. nltkDataFilesLayer

I have a folder in my SAM project "nltkDataFilesLayer". Its contents:

- nltkDataFilesLayer/
  - nltk_data/
    - corpora/
      - wordnet.zip
    - taggers/
      - averaged_perceptron_tagger/
      - averaged_perceptron_tagger.zip
    - tokenizers/
      - punkt/
      - punkt.zip

It just has the nltk_data which is generated by nltk.download on my local machine.

DETAILS: How I made the nltkDataFilesLayer folder

I ran a python script file that calls nltk.download
nltk downloads the files to my AppData/Roaming/nltk_data folder on my windows machine
I copied the entire nltk_data folder.
I change directory to sam project and created a folder "nltkDataFilesLayer"
I paste the nltk_data folder in the newly created "nltkDataFilesLayer"

3. My template.yaml

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: >
  plag1

  Sample SAM Template for plag1

Globals:
  Function:
    Timeout: 60
    MemorySize: 512

Resources:
  nltkDataFilesLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      LayerName: nltkDataFilesLayer
      Description: Data files for nltk
      ContentUri: ./nltkDataFilesLayer
      CompatibleRuntimes:
        - python3.9

  PCheck:
    Type: AWS::Serverless::Function 
    Properties:
      CodeUri: plag_check/
      Handler: app.lambda_handler
      Runtime: python3.9
      Layers:
        - !Ref nltkDataFilesLayer
      Architectures:
        - x86_64

4. I build my code `sam build`

no problems here

5. I test my code locally

I ran sam local invoke plag_check --event events/event.json ERROR!

[ERROR] LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/root/nltk_data'
    - '/var/lang/nltk_data'
    - '/var/lang/share/nltk_data'
    - '/var/lang/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
    raise LookupError(resource_not_found), in findnpickle")sent_tokenizee)
END RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7
REPORT RequestId: 188fc3e0-f33c-4f40-be51-a6a631ac53b7  Init Duration: 0.56 ms  Duration: 6051.22 ms    Billed Duration: 6052 ms    Memory Size: 512 MB     Max Memory Used: 512 MB
{"errorMessage": "\n**********************************************************************\n  Resource \u001b[93mpunkt\u001b[0m not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  \u001b[31m>>> import nltk\n  >>> nltk.download('punkt')\n  \u001b[0m\n  For more information see: https://www.nltk.org/data.html\n\n  Attempted to load \u001b[93mtokenizers/punkt/PY3/english.pickle\u001b[0m\n\n  Searched in:\n    - '/root/nltk_data'\n    - '/var/lang/nltk_data'\n    - '/var/lang/share/nltk_data'\n    - '/var/lang/lib/nltk_data'\n    - '/usr/share/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/lib/nltk_data'\n    - '/usr/local/lib/nltk_data'\n    - ''\n**********************************************************************\n", "errorType": "LookupError", "requestId": "188fc3e0-f33c-4f40-be51-a6a631ac53b7", "stackTrace": ["  File \"/var/task/app.py\", line 8, in lambda_handler\n    tokens = word_tokenize(text)\n", "  File \"/var/task/nltk/tokenize/__init__.py\", line 129, in word_tokenize\n    sentences = [text] if preserve_line else sent_tokenize(text, language)\n", "  File \"/var/task/nltk/tokenize/__init__.py\", line 106, in sent_tokenize\n    tokenizer = load(f\"tokenizers/punkt/{language}.pickle\")\n", "  File \"/var/task/nltk/data.py\", line 750, in load\n    opened_resource = _open(resource_url)\n", "  File \"/var/task/nltk/data.py\", line 876, in _open\n    return find(path_, path + [\"\"]).open()\n", "  File \"/var/task/nltk/data.py\", line 583, in find\n    raise LookupError(resource_not_found)\n"]}

TLDR: My Question & Problem

Background

Objective:

What I did

1. plag_check

2. nltkDataFilesLayer

DETAILS: How I made the nltkDataFilesLayer folder

3. My template.yaml

4. I build my code `sam build`

5. I test my code locally

Solution

Where I found the solution:

Some other useful additional facts I also learned (some relevance)

Recommended topics

Hot tags

TLDR: My Question & Problem

Background

Objective:

What I did

1. plag_check

2. nltkDataFilesLayer

DETAILS: How I made the nltkDataFilesLayer folder

3. My template.yaml

4. I build my code sam build

5. I test my code locally

Solution

Where I found the solution:

Some other useful additional facts I also learned (some relevance)

Recommended topics

Hot tags

4. I build my code `sam build`