Nltk french tokenizer in python not working

Asked 23/2, 2017 at 23:54 Answered 14/5, 2018 at 14:58

Why is the french tokenizer that comes with python not working for me? Am I doing something wrong?

I'm doing

import nltk
content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant couramment au suivi d'étoiles variables, à la découverte de nouveaux astéroïdes et de nouvelles comètes, etc.", 'Séquence vidéo.', "John Richard Bond explique le rôle de l'astronomie."]
tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle')
for i in content_french:
        print(i)
        print(tokenizer.tokenize(i))

But I get non-tokenized output like

John Richard Bond explique le rôle de l'astronomie.
["John Richard Bond explique le rôle de l'astronomie."]

Eonian answered 23/2, 2017 at 23:54 Comment(1)

Off-topic: NLTK is a very outdated package that shouldn't be used for any work these days. If you want a modern solution with better models, try spaCy – Protero 2/7, 2018 at 9:41

tokenizer.tokenize() is sentence tokenizer (splitter). If you want to tokenize words then use word_tokenize():

import nltk
from nltk.tokenize import word_tokenize

content_french = ["Les astronomes amateurs jouent également un rôle important en recherche; les plus sérieux participant couramment au suivi d'étoiles variables, à la découverte de nouveaux astéroïdes et de nouvelles comètes, etc.", 'Séquence vidéo.', "John Richard Bond explique le rôle de l'astronomie."]
for i in content_french:
        print(i)
        print(word_tokenize(i, language='french'))

Reference

Fin answered 24/2, 2017 at 0:17 Comment(0)

The issue with this Tokenizer is that it is not an effective tokenizer for french sentences :

from nltk.tokenize import word_tokenize
content_french = ("John Richard Bond explique le rôle de l'astronomie.")
word_tokenize(content_french, language='french')
>> ['John', 'Richard', 'Bond', 'explique', 'le', 'rôle', 'de', "l'astronomie", '.']

"l'astronomie" should be tokenized as ["l'", 'astronomie'].

You can build a better tokenizer using the RegexpTokenizer, as follows:

from nltk import RegexpTokenizer
toknizer = RegexpTokenizer(r'''\w'|\w+|[^\w\s]''')
toknizer.tokenize(content_french)
>> ['John', 'Richard', 'Bond', ...,"l'", 'astronomie', '.']

Chaworth answered 14/5, 2018 at 14:58 Comment(2)

Dear @J.Doe, thanks a lot for your answer! Could you please elaborate a bit more regarding the content of the regexp? E.g. what does ''' stand for at the beginning and at the end? Thanks! – Bruns 29/10, 2020 at 15:33

@TommasoDiNoto nothing special, it could also be "..." (or any other way to not break the apostoph) – Chaworth 29/10, 2020 at 17:12

Recommended topics

Hot tags