I have a soup from BeautifulSoup
that I cannot pickle. When I try to pickle the object the python interpreter silently crashes (such that it cannot be handled as an exception). I have to be able to pickle the object in order to return the object using the multiprocessing
package (which pickles objects to pass them between processes). How can I troubleshoot/work around the problem? Unfortunately, I cannot post the html for the page (it is not publicly available), and I have been unable to find a reproducible example of the problem. I have tried to isolate the problem by looping over the soup and pickling individual components, the smallest thing that produces the error is <class 'BeautifulSoup.NavigableString'>
. When I print the object it prints out u'\n'
.
The class NavigableString
is not serializable with pickle
or cPickle
, which multiprocessing
uses. You should be able to serialize this class with dill
, however. dill
has a superset of the pickle
interface, and can serialize most of python. multiprocessing
will still fail, unless you use a fork of multiprocessing
which uses dill
, called pathos.multiprocessing
.
Get the code here: https://github.com/uqfoundation.
For more information see: What can multiprocessing and dill do together?
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/
pip
if you use the prerelease flag. Then you can grab the code from github and it installs pretty easily. The next version will be pip-installable. –
Bohn ctypes
through multiprocessing
might work, if you have access to how the map
is called. Otherwise, dill
has some compression options that are currently "turned off". If your large data could go into a numpy
array (…?), then there might be a route that way too. Hard to tell without seeing what your data looks like. Also, use the latest pathos
(from github), and ProcessingPool
as opposed to Pool
. –
Bohn If you do not need the beautiful soup object itself, but some product of the soup, i.e. a text string, you can remove BeautifulSoup attributes from your larger object before pickling by adding the following code to your class definition:
class MyObject(MyObject):
def __getstate__(self):
for item in dir(self):
item_type = str(type(getattr(self, item)))
if 'BeautifulSoup' in itype:
delattr(self, item)
return self.__dict__
__reduce__
to pick out what state you want to save. –
Bohn In fact, as suggested by dekomote, you have only to take advantadge that you can allways convert a soup to an unicode string and then back again the unicode string to a soup.
So IMHO you should not try to pass soup object through the multiprocessing package, but simply the strings representing the soups.
© 2022 - 2024 — McMap. All rights reserved.
BeautifulSoup
? – Debus