ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy
Asked Answered
S

3

8

I'm trying to create a corpus of words by a text. I use spacy. So there is my code:

import spacy
nlp = spacy.load('fr_core_news_md')
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

But it returns this exception:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

I tried somthing like this:

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1027203
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

But got the same error:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

How to fix it?

Shrub answered 27/7, 2019 at 11:21 Comment(2)
What happened when you tried to run the second version?Disordered
What line exactly did the error message refer to? Please include the complete traceback of the error message.Disordered
H
12

I differ from the answer above and I think nlp.max_length did execute correctly but the value set is too low. It looks like you have set it to exactly the value in the error message.Increase the nlp.max_length to a little over the number in the error message:

nlp.max_length = 1030000 # or even higher

It should ideally work after this.

So your code could be changed to this

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1030000 # or higher
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()
Hoffert answered 26/12, 2019 at 20:47 Comment(0)
R
3

I faced the same issue, I had to loop over a directory of text files and perform NER on the text files to extract entities present in them.

for file in folder_text_files:
    with open(file, 'r', errors="ignore") as f:
         text = f.read()
         f.close()
    nlp.max_length = len(text) + 100

So doing this might help you worrying about the text size

Reticulation answered 9/3, 2021 at 7:41 Comment(2)
Any idea if a generator would solve this type of issue?Elianore
@Elianore can you please elaborate on thisReticulation
M
1

It looks like nlp.max_length = 1027203 code in the second example did not execute correctly.

Alternatively, if your text file has multiple lines, you can create your doc for each line in the file. Something like the following:

for line in f.read().split('\n'):
    doc = nlp(''.join(ch for ch in line if ch.isalnum() or ch == " "))
    ...
Muscarine answered 26/12, 2019 at 20:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.