Batch running spaCy nlp() pipeline for large documents
Asked Answered
D

1

8

I am trying to run the nlp() pipeline on a series of transcripts amounting to 20,211,676 characters. I am running on a machine with 8gb RAM. I'm very new at both Python and spaCy, but the corpus comparison tools and sentence chunking features are perfect for the paper I'm working on now.

What I've tried

I've begun by importing the English pipeline and removing 'ner' for faster speeds

nlp = spacy.load('en_core_web_lg', disable = ['ner'])

Then I break up the corpus into pieces of 800,000 characters since spaCy recommends 100,000 characters per gb

split_text = [text[i:i+800000] for i in range(0, len(text), 800000)]

Loop the pieces through the pipeline and create a list of nlp objects

nlp_text = []
for piece in split_text:
    piece = nlp(piece)
    nlp_text.append(piece)

Which works after a long wait period. note: I have tried upping the threshold via 'nlp.max_length' but anything above 1,200,000 breaks my python session.

Now that I have everything piped through I need to concatenate everything back since I will eventually need to compare the whole document to another (of roughly equal size). Also I would be interested in finding the most frequent noun-phrases in the document as a whole, not just in artificial 800,000 character pieces.

nlp_text = ''.join(nlp_text)

However I get the error message:

TypeError: sequence item 0: expected str instance, spacy.tokens.doc.Doc found

I realize that I could turn to string and concatenate, but that would defeat the purpose of having "token" objects to works with.

What I need

Is there anything I can do (apart from using AWS expensive CPU time) to split my documents, run the nlp() pipeline, then join the tokens to reconstruct my complete document as an object of study? Am I running the pipeline wrong for a big document? Am I doomed to getting 64gb RAM somewhere?

Edit 1: Response to Ongenz

(1) Here is the error message I receive

ValueError: [E088] Text of length 1071747 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

I could not find a part of the documentation that refers to this directly.

(2) My goal is to do a series of measures including (but not limited to if need arises): word frequency, tfidf count, sentence count, count most frequent noun-chunks, comparing two corpus using w2v or d2v strategies. My understanding is that I need every part of the spaCy pipeline apart from Ner for this.

(3) You are completely right about cutting the document, in a perfect world I would cut on a line break instead. But as you mentioned I cannot use join to regroup my broken-apart corpus, so it might not be relevant anyways.

Decorative answered 20/9, 2018 at 16:54 Comment(3)
Hi, what processing do you want to do? Just sentence tokenisation? Word tokenisation? Or do you want part-of-speech tagging and dependency parsing as well? Could you post the relevant section in the spaCy docs that says "spaCy recommends 100,000 characters per gb" please? Also, the join function (doc) takes an iterable (e.g. a string or list of string objects) as argument and you are passing it a list of spaCy Doc objects. What you do here will depend on the processing you want to do (I suspect you will not want to use join).Theodore
And splitting your document on set-length character sequences will risk splitting words in two - does that matter to you? If your text is indeed too big you may be able to preprocess it into smaller files first using a less "brutal" method (e.g. cutting on line breaks instead of characters).Theodore
Thank you for your answer ongenz: I edited the post to reflect your questions.Decorative
R
2

You need to join the resulting Docs using the Doc.from_docs method:

docs = []
for piece in split_text:
    doc = nlp(piece)
    docs.append(doc)
merged = Doc.from_docs(docs)

See the documentation here fore more details.

Rostock answered 8/6, 2022 at 15:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.