How do I test whether an nltk resource is already installed on the machine running my code?

Asked 16/5, 2014 at 21:3 Answered 21/2, 2024 at 11:32

I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download(). For my collaborators I of course want that this things get downloaded automatically. I haven't found any idiomatic code for that in the docu.

Am I supposed to just put nltk.data.load('tokenizers/punkt/english.pickle') and their like into the code? Is this going to download the resources every time the script is run? Am I to provide feedback to the user (i.e. my co-developers) of what is being downloaded and why this is taking so long? There MUST be gear out there that does the job, right? :)

//Edit To explify my question:
How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not?

Asmara answered 16/5, 2014 at 21:3 Comment(3)

I'm having trouble determining what you're asking. A concise, testable code example demonstrating your current approach would be very helpful. – Creath 16/5, 2014 at 21:43

Let me reframe the question: How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not? – Asmara 16/5, 2014 at 22:54

Edit your question to match your comment. Putting the short question in the comments may let it get overlooked – Cowans 17/5, 2014 at 12:30

You can use the nltk.data.find() function, see https://github.com/nltk/nltk/blob/develop/nltk/data.py:

>>> import nltk
>>> nltk.data.find('tokenizers/punkt.zip')
ZipFilePathPointer(u'/home/alvas/nltk_data/tokenizers/punkt.zip', u'')

When the resource is not available you'll find the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk-3.0a3-py2.7.egg/nltk/data.py", line 615, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'punkt.zip' not found.  Please use the NLTK Downloader
  to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Most probably, you would like to do something like this to ensure that your collaborators have the package:

>>> try:
...     nltk.data.find('tokenizers/punkt')
... except LookupError:
...     nltk.download('punkt')
... 
[nltk_data] Downloading package punkt to /home/alvas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True

Nemesis answered 17/5, 2014 at 19:50 Comment(7)

There is a trap to this approach, which is that you can't reliably use it to install the data in a non-interactive application. Python will import nltk without the downloaded resource. If you discover this fact with a LookupError and then try to run nltk.download and then re-import the relevant nltk module, Python will believe nltk was already imported and not re-import anything. So even though you'll have downloaded the new data artifact, the imported version of NLTK will still be the one that was booted up without access to it. – Padriac 25/7, 2019 at 15:11

For example, you often need from nltk import wordnet but this submodule of nltk only exists if wordnet was downloaded prior to when nltk was imported. If you try .. except this import and check for LookupError and then dynamically run nltk.download('wordnet'), it will indeed install the data for wordnet, but re-running from nltk import wordnet will still fail (the nltk module being referenced will still be the one that booted up with no wordnet submodule in it). – Padriac 25/7, 2019 at 15:13

@Padriac what's the remedy then? – Atworth 14/12, 2021 at 19:6

PEP8 recommends to set all imports at the top of the file. In this case, anyway, it seems we can avoid the trap only running first "import nltk", then the try-except clause, and finally the specific import like "from nltk import ...". This seems a bit of workaround anyway. – Giza 17/2, 2022 at 9:46

Or maybe "from nltk import data, download" first, and after the try-except surround "import nltk"? – Giza 17/2, 2022 at 9:53

@Giza Can you share a code with the workaround? I don't seem to understand it very clearly. – Schaefer 31/5, 2022 at 0:35

@SomnathRakshit, I'm posting an example below – Giza 29/7, 2022 at 12:13

After Somnath comment, I am posting an example of the try-except workaround. Here we search for the comtrans module that is not in the nltk data by default.

from nltk.corpus import comtrans
from nltk import download

try:
    words = comtrans.words('alignment-en-fr.txt')
except LookupError:
    print('resource not found. Downloading now...')
    download('comtrans')
    words = comtrans.words('alignment-en-fr.txt')

Giza answered 29/7, 2022 at 12:15 Comment(0)

Thought I'd give my 2 cents on this even if a bit late to the party.

nltk has two functions: download and downloader.

download() already contains logic that checks if the package is downloaded & up to date:

from pathlib import Path
from nltk import download as nltk_download
from typing import List, Any
from nltk.downloader import Downloader
import logging

def download_nltk_data(
        list_of_resources: List[str],
        download_dir: Path,
) -> None:
    for resource in list_of_resources:
        nltk_download(
            info_or_id=resource,
            download_dir=download_dir,
            quiet=True, # Change this if you wanna suppress the message
        )

download_nltk_data(
    list_of_resources=[
        'stopwords',
        'punkt',
    ],
    download_dir=Path('./data/nltk/'),
)

Output:

[nltk_data] Downloading package stopwords to data\nltk...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to data\nltk...
[nltk_data]   Package punkt is already up-to-date!

All you need to do is change quiet=True if you want to suppress this.

If for some reason you want finer control over the packages you can use the Downloader class and extend functionality:

def check_package_exists(
    package_id: Any,
    download_dir: Path,
) -> bool:
    downloader = Downloader(download_dir=str(download_dir))
    return downloader.is_installed(package_id)

def download_nltk_data(
    list_of_resources: List[str],
    download_dir: Path,
) -> None:
    download_dir.mkdir(parents=True, exist_ok=True)
    downloader = Downloader(download_dir=str(download_dir))
    for resource in list_of_resources:
        if not check_package_exists(resource, download_dir):
            logging.debug(f'Downloading {resource} to {download_dir}')
            downloader.download(info_or_id=resource, quiet=True)
        else:
            logging.debug(f'{resource} already exists in {download_dir}')


download_nltk_data(
    list_of_resources=[
        'stopwords',
        'punkt',
    ],
    download_dir=Path('./data/nltk/'),
)

Output:

stopwords already exists in data\nltk
punkt already exists in data\nltk

Or something like that

Sculptress answered 21/2, 2024 at 11:32 Comment(0)

Recommended topics

Hot tags