BeautifulSoup Object Will Not Pickle, Causes Interpreter to Silently Crash
Asked Answered
D

3

9

I have a soup from BeautifulSoup that I cannot pickle. When I try to pickle the object the python interpreter silently crashes (such that it cannot be handled as an exception). I have to be able to pickle the object in order to return the object using the multiprocessing package (which pickles objects to pass them between processes). How can I troubleshoot/work around the problem? Unfortunately, I cannot post the html for the page (it is not publicly available), and I have been unable to find a reproducible example of the problem. I have tried to isolate the problem by looping over the soup and pickling individual components, the smallest thing that produces the error is <class 'BeautifulSoup.NavigableString'>. When I print the object it prints out u'\n'.

Debus answered 3/7, 2014 at 20:53 Comment(4)
Unfortunately, aside from casting NavigableString to a unicode or str, there's nothing you can do here (well, patch beautifulsoup as well)Saunder
@Saunder Is this a known issue with BeautifulSoup?Debus
Yup. NavigableString is not pickle-able. It should implement unicode but it fails somehow.Saunder
How do I require all objects created by BeautifulSoup to be turned into unicode prior to pickling and returned to their original type after pickling, keeping in mind I am doing this within the multiprocessing package?.Debus
B
6

The class NavigableString is not serializable with pickle or cPickle, which multiprocessing uses. You should be able to serialize this class with dill, however. dill has a superset of the pickle interface, and can serialize most of python. multiprocessing will still fail, unless you use a fork of multiprocessing which uses dill, called pathos.multiprocessing.

Get the code here: https://github.com/uqfoundation.


For more information see: What can multiprocessing and dill do together?

http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

http://nbviewer.ipython.org/gist/minrk/5241793

Bohn answered 3/7, 2014 at 21:37 Comment(5)
Thanks for the info. It would be great is this were installable via pip. When I try to install it I get the error message: "Could not find a version that satisfies the requirement pathos (from versions: 0.1a1)"Debus
I'm aware that's an issue with the old released versions. However, you can install the older versions with pip if you use the prerelease flag. Then you can grab the code from github and it installs pretty easily. The next version will be pip-installable.Bohn
I installed pathos and tried to use its pool function in place of the base multiprocessing package. However, I am encountering the same issue as I describe at the link below. The object that causes it to hang on return can be pickled using dill, but it may be too big for multiprocessing queues. Any suggestions for making large objects work with pathos? #24537879Debus
Detailed description in new question: #24620142Debus
You may be able to get away with compression or with shared memory. Shared memory with ctypes through multiprocessing might work, if you have access to how the map is called. Otherwise, dill has some compression options that are currently "turned off". If your large data could go into a numpy array (…?), then there might be a route that way too. Hard to tell without seeing what your data looks like. Also, use the latest pathos (from github), and ProcessingPool as opposed to Pool.Bohn
D
2

If you do not need the beautiful soup object itself, but some product of the soup, i.e. a text string, you can remove BeautifulSoup attributes from your larger object before pickling by adding the following code to your class definition:

class MyObject(MyObject):

    def __getstate__(self):
        for item in dir(self):
            item_type = str(type(getattr(self, item)))
            if 'BeautifulSoup' in itype:
                delattr(self, item)

        return self.__dict__
Debus answered 9/7, 2014 at 22:2 Comment(1)
sure… that makes sense. Essentially, use __reduce__ to pick out what state you want to save.Bohn
R
1

In fact, as suggested by dekomote, you have only to take advantadge that you can allways convert a soup to an unicode string and then back again the unicode string to a soup.

So IMHO you should not try to pass soup object through the multiprocessing package, but simply the strings representing the soups.

Rowlock answered 3/7, 2014 at 21:13 Comment(2)
"you should not try to pass soup object through the multiprocessing package, but simply the strings representing the soups." This should occur automatically though because multiprocessing passes objects by pickling them and the soup should pass its string representation to pickle.Debus
I just did some tests and pickle.dump is about 20% faster than prettify while pickle.load is about 35% faster than Beautifulsoup(html) (n=1 with 300kb file, so just to get an idea). Meanwhile pickle doesn't work for large soups due to (hard) recursion limit, so I think this small performance hit is worth it in most cases.Erickson

© 2022 - 2024 — McMap. All rights reserved.