I am trying to run the nlp() pipeline on a series of transcripts amounting to 20,211,676 characters. I am running on a machine with 8gb RAM. I'm very new at both Python and spaCy, but the corpus comparison tools and sentence chunking features are perfect for the paper I'm working on now.
What I've tried
I've begun by importing the English pipeline and removing 'ner' for faster speeds
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
Then I break up the corpus into pieces of 800,000 characters since spaCy recommends 100,000 characters per gb
split_text = [text[i:i+800000] for i in range(0, len(text), 800000)]
Loop the pieces through the pipeline and create a list of nlp objects
nlp_text = []
for piece in split_text:
piece = nlp(piece)
nlp_text.append(piece)
Which works after a long wait period. note: I have tried upping the threshold via 'nlp.max_length' but anything above 1,200,000 breaks my python session.
Now that I have everything piped through I need to concatenate everything back since I will eventually need to compare the whole document to another (of roughly equal size). Also I would be interested in finding the most frequent noun-phrases in the document as a whole, not just in artificial 800,000 character pieces.
nlp_text = ''.join(nlp_text)
However I get the error message:
TypeError: sequence item 0: expected str instance, spacy.tokens.doc.Doc found
I realize that I could turn to string and concatenate, but that would defeat the purpose of having "token" objects to works with.
What I need
Is there anything I can do (apart from using AWS expensive CPU time) to split my documents, run the nlp() pipeline, then join the tokens to reconstruct my complete document as an object of study? Am I running the pipeline wrong for a big document? Am I doomed to getting 64gb RAM somewhere?
Edit 1: Response to Ongenz
(1) Here is the error message I receive
ValueError: [E088] Text of length 1071747 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).
I could not find a part of the documentation that refers to this directly.
(2) My goal is to do a series of measures including (but not limited to if need arises): word frequency, tfidf count, sentence count, count most frequent noun-chunks, comparing two corpus using w2v or d2v strategies. My understanding is that I need every part of the spaCy pipeline apart from Ner for this.
(3) You are completely right about cutting the document, in a perfect world I would cut on a line break instead. But as you mentioned I cannot use join to regroup my broken-apart corpus, so it might not be relevant anyways.
join
function (doc) takes an iterable (e.g. a string or list of string objects) as argument and you are passing it a list of spaCyDoc
objects. What you do here will depend on the processing you want to do (I suspect you will not want to usejoin
). – Theodore