python-boilerpipe hangs with multiprocessing
Asked Answered
E

1

12

I am trying to run boilerpipe with Python multiprocessing. Doing this to parse RSS feeds from multiple sources. The problem is it hangs in one of the threads after processing some links. The whole flow works if I remove the pool and run it in a loop.

Here is my multiprocessing code:

proc_pool = Pool(processes=4)
for each_link in data:
    proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()

This is my boilerpipe code which is being called inside process_link_for_feeds():

def parse_using_bp(in_url):
    extracted_html = ""
    if ContentParser.url_skip_p.match(in_url):
        return extracted_html
    try:
        extractor = Extractor(extractor='ArticleExtractor', url=in_url)
        extracted_html = extractor.getHTML()
        del extractor
    except BaseException as e:
        print "Something's wrong at Boilerpipe -->", in_url, "-->", e
        extracted_html = ""
    finally:
        return extracted_html

I am clueless on why it is hanging. Is there something wrong in the proc_pool code?

Erewhile answered 6/12, 2013 at 9:22 Comment(4)
What store_results_to_db does?Entail
@Entail It adds the results to mongodb.Erewhile
Can you provide more details. What is ContentParser, what is Extractor? - I tried to mimic your problem but in my case it works.Theurich
Looking into it I found the source here. All the code is just thrown into init.py (edit - actually ContentParser doesn't seem to be in there)Katlin
K
1

Can you try threading instead? Multiprocessing is basically for when you are CPU bound. Also, boilerpipe already includes protection when using threading which suggests that it may need protection in multiprocessing also.

If you really need mp, I will try to figure out how to patch boilerpipe.

Here is what I guess will be a drop-in replacement using threading. It uses multiprocessing.pool.ThreadPool (which is a "fake" multiprocessing pool). The only change is from Pool(..) to multiprocessing.pool.ThreadPool(...) The problem is that I'm not sure the boilerpipe multithreading test will detect the thread pool () as having activeCount() > 1.

import multiprocessing
from multiprocessing.pool import ThreadPool  # hidden ThreadPool class

# ...
proc_pool = ThreadPool(processes=4)  # this is the only difference
for each_link in data:
    proc_pool.apply_async(process_link_for_feeds, args=(each_link, ), callback=store_results_to_db)
proc_pool.close()
proc_pool.join()
Katlin answered 15/12, 2013 at 14:34 Comment(8)
@dpatro, even if you actually need multiprocessing, could you try this? If this works, then it might be straightforward to patch boilerpipe for multiprocessing protection.Katlin
I also thought of same (github code). Moved to Java a week back to make boilerpipe run. Will check it out today and update the thread.Erewhile
Checked and it is working for continuously over a day now. The processing is slowed down a bit though, may be due to the library. Thanks @kobejohn for the insight to the solution.Erewhile
@Erewhile that's great news. Thanks for letting me/everyone know it works. If multiprocessing worked faster, then it may be boilerpipe or the other parts of your code that are faster. If you can remove the other parts of the code and demonstrate that boilerpipe is significantly faster with multiprocessing, that would be a great way to submit a feature request on github.Katlin
But the problem is with multiprocessing, boilerpipe hangs. My problem in first place!Erewhile
@Erewhile Indeed. Complete failure of logic on my part there! Let me restate that. If the speed is important to you AND you are convinced it would work faster with multiprocessing, you could post another question on SO (or to the ypthon boilerpipe maintainer) specifically asking how to patch that piece of boilerpipe to work with multiprocessing. Sorry for the silly comment.Katlin
sure. I will do that once I get sometime after my work hours :)Erewhile
I got a chance to check things properly. It turns out that ThreadPool is faster than ProcessPool. My previous remark on speed was a mistake as I forgot to check the internet speed that day.Erewhile

© 2022 - 2024 — McMap. All rights reserved.