Installing nltk data dependencies in setup.py script
Asked Answered
J

4

16

I use NLTK with wordnet in my project. I did the installation manually on my PC, with pip: pip3 install nltk --user in a terminal, then nltk.download() in a python shell to download wordnet.

I want to automatize these with a setup.py file, but I don't know a good way to install wordnet.

For the moment, I have this piece of code after the call to setup ("nltk" is in the install_requires list of the call to setup):

import sys
if 'install' in sys.argv:
    import nltk
    nltk.download("wordnet")

Is there a better way to do this?

Jerol answered 7/11, 2014 at 11:7 Comment(3)
@martin-thoma from a quick glance, looks like the nltk data dependencies could be packaged as Python projects and distributed on PyPI without too much work. The whole thing could be relatively easily scripted and delegated to a CI/CD system. You should weigh in on these tickets: github.com/nltk/nltk_data/issues/12 github.com/nltk/nltk/issues/2228Spagyric
@martin-thoma also, here is a rather similar post I wrote about the same problem with spacy: https://mcmap.net/q/263588/-package-spacy-model/… does that apply to your situation as well?Winery
For my use case, the best option seemed to be to list all dependencies in a requirements.txt file and use pip install -r requirements.txt first. Then in my setup.py I have the manual download command nltk.download("punkt") which is used when I run pip install -e . I believe this works because I'm building a Docker image/container, not trying to distribute a package.Chianti
H
14

I managed to install the NLTK data in setup.py by overriding cmdclass with my own Install class :

from setuptools import setup, find_packages
from setuptools.command.install import install as _install


class Install(_install):
    def run(self):
        _install.do_egg_install(self)
        import nltk
        nltk.download("popular")

setup(...
    cmdclass={'install': Install},
    ...
    install_requires=[
      'nltk',
      ],
    setup_requires=['nltk']
    ...
   )

It is important to use the method do_egg_install() in your run() method to make sure nltk gets installed, before import nltk is called (See also here python setuptools install_requires is ignored when overriding cmdclass). Also don't forget to add nltk to setup_requires.

Huppert answered 14/4, 2015 at 13:23 Comment(3)
Doesn't work for me!Grasp
Also did not work for meChianti
Also did not work for meChampignon
J
3

You can also automate installation with a shell script, for example, running (after pip installing nltk):

python -m nltk.downloader -d /usr/share/nltk_data wordnet
Jaipur answered 30/11, 2014 at 18:51 Comment(0)
P
1

As stated in this thread, external data should not be handled by setuptools in setup.py. As an alternative I suggest that in the __init__.py file of your package you include the following lines (putting the case that you want to download the punkt and stopwords) :

__version__ = "x.x.x"
__organization__ = "your_organization"  
import nltk 
nltk.download("stopwords") 
nltk.download("punkt")  

This way the files will not be downloaded when the package is installed, but when it is imported (i.e. import my_package).


As an example I share a link to a python library that does just this.

First you would have to install the library:

pip install -U pyleetspeak

And then importing the library will download the NLTK files:

import pyleetspeak
pyleetspeak.__version__

enter image description here

Phraseology answered 19/12, 2022 at 15:20 Comment(0)
R
0

This setup worked for me:

from setuptools import setup, find_packages
from setuptools.command.install import install

class InstallCommand(install):
    def run(self):
        install.run(self)
        import nltk
        nltk.download('wordnet')

setup(
    # other options...

    install_requires=['nltk'],
    setup_requires=['nltk'],
    cmdclass={
        'install': InstallCommand,
    }
)
Ruralize answered 20/2, 2024 at 19:29 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.