Using python lime as a udf on spark
Asked Answered
M

3

9

I'm looking to use lime's explainer within a udf on pyspark. I've previously trained the tabular explainer, and stored is as a dill model as suggested in link

loaded_explainer = dill.load(open('location_to_explainer','rb'))

def lime_explainer(*cols):
    selected_cols = np.array([value for value in cols])
    exp = loaded_explainer.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
    mapping = exp.as_map()[1]

    return str(mapping)

This however takes a lot of time, as it appears a lot of the computation happens on the driver. I've then been trying to use spark broadcast to broadcast the explainer to the executors.

broadcasted_explainer= sc.broadcast(loaded_explainer)

def lime_explainer(*col):
    selected_cols = np.array([value for value in cols])
    exp = broadcasted_explainer.value.explain_instance(selected_cols, loaded_model.predict_proba, num_features = 10)
    mapping = exp.as_map()[1]

    return str(mapping)        

However, I run into a pickling error, on broadcast.

PicklingError: Can't pickle at 0x7f69fd5680d0>: attribute lookup on lime.discretize failed

Can anybody help with this? Is there something like dill that we can use instead of the cloudpickler used in spark?

Metaphosphate answered 26/3, 2019 at 8:55 Comment(0)
T
0

Looking at this source, it seems like you have no choice but to use the pickler provided. As such, I can only suggest that you nest dill inside of the default pickler. Not ideal, but it could work. Try something like:

broadcasted_explainer = dill.loads(sc.broadcast(dill.dumps(loaded_explainer)).value)

Or you might try calling the Dill extend() method which is supposed to add Dill datatypes into the default pickle package dispatch. No idea if that will work but you can try it!

Threesome answered 3/4, 2019 at 22:44 Comment(0)
L
3

I'm the dill author. I agree with @Majaha, and will extend @Majaha's answer slightly. In the first link in @Majaha's answer, it's clearly pointed out that a Broadcast instance is hardwired to use pickle... so the suggestion to dill to a string, then undill afterward is a good one.

Unfortunately, the extend method probably won't work for you. In the Broadcast class, the source uses CPickle, which dill cannot extend. If you look at the source, it uses import CPickle as pickle; ... pickle.dumps for python 2, and import pickle; ... pickle.dumps for python 3. Had it used import pickle; ... pickle.dumps for python 2, and import pickle; ... pickle._dumps for python 3, then dill could extend the pickler by just doing an import dill. For example:

Python 3.6.6 (default, Jun 28 2018, 05:53:46) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pickle import _dumps
>>> import dill
>>> _dumps(lambda x:x)
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x01KCC\x04|\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x07\x00\x00\x00<stdin>q\tX\x08\x00\x00\x00<lambda>q\nK\x01C\x00q\x0b))tq\x0cRq\rc__main__\n__dict__\nh\nNN}q\x0etq\x0fRq\x10.'

You could, thus, either do what @Majaha suggests (and bookend the call to broadcast) or you could patch the code to make the replacement that I outline above (where needed, but eh...), or you may be able to make your own derived class that does the job using dill:

>>> from pyspark.broadcast import Broadcast as _Broadcast
>>>
>>> class Broadcast(_Broadcast):
...   def dump(self, value, f):
...     try:
...       import dill
...       dill.dump(value, f, pickle_protocol)
...     ...[INSERT THE REST OF THE DUMP METHOD HERE]...

If the above fails... you could still get it to work by pinpointing where the serialization failure occurs (there's dill.detect.trace to help you with that).

If you are going to suggest to pyspark to use dill... a potentially better suggestion is to allow users to dynamically replace the serializer. This is what mpi4py and a few other packages do.

Lamelliform answered 7/4, 2019 at 17:11 Comment(0)
T
0

Looking at this source, it seems like you have no choice but to use the pickler provided. As such, I can only suggest that you nest dill inside of the default pickler. Not ideal, but it could work. Try something like:

broadcasted_explainer = dill.loads(sc.broadcast(dill.dumps(loaded_explainer)).value)

Or you might try calling the Dill extend() method which is supposed to add Dill datatypes into the default pickle package dispatch. No idea if that will work but you can try it!

Threesome answered 3/4, 2019 at 22:44 Comment(0)
I
0

What's your location_to_explainer data schema? Maybe it better to transform to spark's dataframe.

According to dill desciption

dill can be used to store python objects to a file, but the primary usage is to send python objects across the network as a byte stream. dill is quite flexible, and allows arbitrary user defined classes and functions to be serialized. Thus dill is not intended to be secure against erroneously or maliciously constructed data. It is left to the user to decide whether the data they unpickle is from a trustworthy source.

And When Not To Use pickle

If you want to use data across different programming languages, pickle is not recommended. Its protocol is specific to Python, thus, cross-language compatibility is not guaranteed. The same holds for different versions of Python itself. Unpickling a file that was pickled in a different version of Python may not always work properly, so you have to make sure that you're using the same version and perform an update if necessary. You should also try not to unpickle data from an untrusted source. Malicious code inside the file might be executed upon unpickling.

According to this discuss , you can try pysparkling

I don't think this is a dill issue, as I don't think your code is using dill. So, as far as I know, pyspark uses pickle or cloudpickle and not dill. However, if you do want to use dill with pyspark, there is pysparkling (https://pypi.python.org/pypi/pysparkling)... and using it may clear up your serialization issue. What I suggest is that you open a ticket with pyspark or try pysparkling and if it fails, open a ticket there -- and CC me or refer to this issue so I can follow the thread. I'm going to close this... so if I'm incorrect and you are using dill, please feel free to reopen this issue.

Read more Reading pyspark pickles locally

Impartial answered 3/4, 2019 at 23:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.