I am using gunicorn
with multiple workers for my machine learning project. But the problem is when I send a train request only the worker getting the training request gets updated with the latest model after training is done. Here it is worth to mention that, to make the inference faster I have programmed to load the model once after each training. This is why, the only worker which is used for current training operation loads the latest model and the other workers still keeps the previously loaded model. Right now the model file (binary
format) is loaded once after each training in a global dictionary
variable where key
is the model name and the value is the model file. Obviously, this problem won't occur if I program it to load the model every time from disk for each prediction, but I cannot do it, as it will make the prediction slower.
I studied further on global variables and further investigation shows that, in a multi-processing environment, all the workers
(processes
) create their own copies of global
variables. Apart from the binary model file, I also have some other global
variables (in dictionary
type) need to be synced across all processes. So, how to handle this situation?
TL;DR: I need some approach which can help me to store variable which will be common across all the processes (workers). Any way to do this? With multiprocessing.Manager
, dill
etc.?
Update 1: I have multiple machine learning algorithms in my project and they have their own model files, which are being loaded to memory in a dictionary
where the key
is the model name and the value
is the corresponding model object. I need to share all of them (in other words, I need to share the dictionary
). But some of the models are not pickle
serializable like - FastText
. So, when I try to use a proxy variable (in my case a dictionary
to hold models) with multiprocessing.Manager
I get error for those non-pickle-serializable
object while assigning the loaded model file to this dictionary. Like: can't pickle fasttext_pybind.fasttext objects
. More information on multiprocessing.Manager
can be found here: Proxy Objects
Following is the summary what I have done:
import multiprocessing
import fasttext
mgr = multiprocessing.Manager()
model_dict = mgr.dict()
model_file = fasttext.load_model("path/to/model/file/which/is/in/.bin/format")
model_dict["fasttext"] = model_file # This line throws this error
Error:
can't pickle fasttext_pybind.fasttext objects
I printed the model_file
which I am trying to assign, it is:
<fasttext.FastText._FastText object at 0x7f86e2b682e8>
Update 2: According to this answer I modified my code a little bit:
import fasttext
from multiprocessing.managers import SyncManager
def Manager():
m = SyncManager()
m.start()
return m
# As the model file has a type of "<fasttext.FastText._FastText object at 0x7f86e2b682e8>" so, using "fasttext.FastText._FastText" as the class of it
SyncManager.register("fast", fasttext.FastText._FastText)
# Now this is the Manager as a replacement of the old one.
mgr = Manager()
ft = mgr.fast() # This line gives error.
This gives me EOFError
.
Update 3: I tried using dill
both with multiprocessing
and multiprocess
. The summary of changes are as the following:
import multiprocessing
import multiprocess
import dill
# Any one of the following two lines
mgr = multiprocessing.Manager() # Or,
mgr = multiprocess.Manager()
model_dict = mgr.dict()
... ... ...
... ... ...
model_file = dill.dumps(model_file) # This line throws the error
model_dict["fasttext"] = model_file
... ... ...
... ... ...
# During loading
model_file = dill.loads(model_dict["fasttext"])
But still getting the error: can't pickle fasttext_pybind.fasttext objects
.
Update 4:
This time I am using another library called jsonpickle. It seems to be that serialization and de-serialization occurs properly (as it is not reporting any issue while running). But surprisingly enough, after de-serialization whenever I am making a prediction, it faces segmentation fault
. More details and the steps to reproduce it can be found here: Segmentation fault (core dumped)
Update 5: Tried cloudpickle, srsly, but couldn't make the program working.
multiprocessing.Manager
is probably your best bet, but there's plenty of other options as described in the docs – JacobFasttext
supervised model (Fasttext
fromFacebook
) which is notpickle
serializable. But to be shared across processes the object is needed to be serializable (docs.python.org/3/library/multiprocessing.html#proxy-objects), so I ended up getting error:can't pickle fasttext_pybind.fasttext objects
. – Legerdill
,multiprocess
,klepto
, andppft
. One of these may help. If you can serialize the object withdill
, then you can pass it between processes. If it's a big object, then you can try using amultiprocess.Manager
to share one object across multiple cores, or if the computation is lighter then you might trymultiprocess.dummy
for threading.ppft
is likemultiprocess
, but usesdill.source
to extract source code instead of serialization.klepto
can share objects between processes through database-like objects. First trydill.dumps
to see if it pickles. – Diglotmodel_file = dill.dumps(model_file)
also gives the same error:can't pickle fasttext_pybind.fasttext objects
. The model size is typically around 250 MB. – Legermultiprocess
instead ofmultiprocessing
along with it. But getting the same error. I am totally clueless. – Legermultiprocess
won't work ifdill
can't pickle the object. So,klepto
is probably also not going to work. I think the answer is that the object is not serializable, and you can't do what you want to do. – DiglotFasttext
which provides a__reduce__
method so the state can be stored. Or, with that knowledge, you can then register a new method to the pickle registry indill
, thus teachingdill
how to serialize aFasttext
object. – Diglotredis pub-sub
. But I didn't share the model file viapub-sub
. But I shared only a message to other workers to load the model from the disk. I had to compromise the fact that each of the workers will load a separate copy of the model and this strategy consumes more memory. – Leger