I'm using "spacy" on python for text documents lemmatization. There are 500,000 documents having size up to 20 Mb of clean text.
The problem is the following: spacy memory consuming is growing in time till the whole memory is used.
My hardware configuration: CPU: Intel I7-8700K 3.7 GHz (12 cores) Memory: 16 Gb SSD: 1 Tb GPU is onboard but is not used for this task
I'm using "multiprocessing" to split the task among several processes (workers). Each worker receives a list of documents to process. The main process performs monitoring of child processes. I initiate "spacy" in each child process once and use this one spacy instance to handle the whole list of documents in the worker.
Memory tracing says the following:
[ Memory trace - Top 10 ]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B
:487: size=9550 KiB, count=77746, average=126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B
prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB
/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B
prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB
I have seen a good recommendation to build a separated server-client solution [here]Is possible to keep spacy in memory to reduce the load time?
Is it possible to keep memory consumption under control using "multiprocessing" approach?
Here is a simplified version of my code:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
# Get word's lemma
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
# Worker's logic
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
documentCount += 1
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#- for WORKING status:
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
# worker has finished his job. Close the connection.
# Whait for some time and monitor again
print("**** DONE ! ****")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
could be. In general, I would expect that you would find some useful info in my repospacy-extreme
which deals with memory issues when using spaCy. – Eatmon