Does beautiful soup work with iron python? If so with which version of iron python? How easy is it to distribute a windows desktop app on .net 2.0 using iron python (mostly c# calling some python code for parsing html)?
I was asking myself this same question and after struggling to follow advice here and elsewhere to get IronPython and BeautifulSoup to play nicely with my existing code I decided to go looking for an alternative native .NET solution. BeautifulSoup is a wonderful bit of code and at first it didn't look like there was anything comparable available for .NET, but then I found the HTML Agility Pack and if anything I think I've actually gained some maintainability over BeautifulSoup. It takes clean or crufty HTML and produces a elegant XML DOM from it that can be queried via XPath. With a couple lines of code you can even get back a raw XDocument and then craft your queries in LINQ to XML. Honestly, if web scraping is your goal, this is about the cleanest solution you are likely to find.
Edit
Here is a simple (read: not robust at all) example that parses out the US House of Representatives holiday schedule:
using System;
using System.Collections.Generic;
using HtmlAgilityPack;
namespace GovParsingTest
{
class Program
{
static void Main(string[] args)
{
HtmlWeb hw = new HtmlWeb();
string url = @"http://www.house.gov/house/House_Calendar.shtml";
HtmlDocument doc = hw.Load(url);
HtmlNode docNode = doc.DocumentNode;
HtmlNode div = docNode.SelectSingleNode("//div[@id='primary']");
HtmlNodeCollection tableRows = div.SelectNodes(".//tr");
foreach (HtmlNode row in tableRows)
{
HtmlNodeCollection cells = row.SelectNodes(".//td");
HtmlNode dateNode = cells[0];
HtmlNode eventNode = cells[1];
while (eventNode.HasChildNodes)
{
eventNode = eventNode.FirstChild;
}
Console.WriteLine(dateNode.InnerText);
Console.WriteLine(eventNode.InnerText);
Console.WriteLine();
}
//Console.WriteLine(div.InnerHtml);
Console.ReadKey();
}
}
}
I've tested and used BeautifulSoup with both IPy 1.1 and 2.0 (forget which beta, but this was a few months back). Leave a comment if you are still having trouble and I'll dig out my test code and post it.
If BeautifulSoup doesn't work on IronPython, it's because IronPython doesn't implement the whole Python language (the same way CPython does). BeautifulSoup is pure-python, no C-extensions, so the only problem is the compatibility of IronPython with CPython in terms of Python source code.There shouldn't be one, but if there is, the error will be obvious ("no module named ...", "no method named ...", etc.). Google says that only one of BS's tests fails with IronPython. it probably works, and that test may be fixed by now. I wouldn't know.
Try it out and see, would be my advice, unless anybody has anything more concrete.
Also, regarding one of the previous comments about compiling with -X:SaveAssemblies - that is wrong. -X:SaveAssemblies is meant as a debugging feature. There is a API meant for compiling python code into binaries. This post explains the API and the difference between the two modes.
We are distributing a 40k line IronPython application. We have not been able to compile the whole thing into a single binary distributable. Instead we have been distributing it as a zillion tiny dlls, one for each IronPython module. This works fine though.
However, on the newer release, IronPython 2.0, we have a recent spike which seems to be able to compile everything into a single binary file. This also results in faster application start-up too (module importing is faster.) Hopefully this spike will migrate into our main tree in the next few days.
To do the distribution we are using WiX, which is a Microsoft internal tool for creating msi installs, that has been open-sourced (or made freely available, at least.) It has given us no problems, even though our install has some quite fiddly requirements. I will definitely look at using WiX to distribute other IronPython projects in the future.
Seems to work just fine with IronPython 2.7. Just need to point it at the right folder and away you go:
D:\Code>ipy
IronPython 2.7 (2.7.0.40) on .NET 4.0.30319.235
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append("D:\Code\IronPython\BeautifulSoup-3.2.0")
>>> import urllib2
>>> from BeautifulSoup import BeautifulSoup
>>> page = urllib2.urlopen("http://www.example.com")
>>> soup = BeautifulSoup(page)
<string>:1: DeprecationWarning: object.__new__() takes no parameters
>>> i = soup('img')[0]
>>> i['src']
'http://example.com/blah.png'
I haven't tested it, but I'd say it'll most likely work with the latest IPy2.
As for distribution, it's very simple. Use the -X:SaveAssemblies option to compile your Python code down to a binary and then ship it with your other DLLs and the IPy dependencies.
Yes, it's possible. I'm using ironpython v3.4.0 with the latest versions of bs4 (v4.12.2) and soupsieve (v2.4.1).
Copy bs4
and soupsieve
folders from your cpython env to your {IPYTHON_DIR}/lib/site-packages
folder. Alternately, you can put them elsewhere and call sys.path.append()
to add the directory.
Edit bs4\builder\_lxml.py
and comment out the following lines:
# if len(markup) > 0 and markup[0] == u'\N{BYTE ORDER MARK}':
# markup = markup[1:]
If anyone knows how to make the above snippet compatible with ipython 3.4, please suggest edits.
Now, fire up your ipy console...
import bs4, soupsieve as sv
text = """<div><!-- These are animals --><p class="a">Cat</p><p class="b">Dog</p><p class="c">Mouse</p></div>"""
bs = bs4.BeautifulSoup(text)
bs.select('p:is(.a, .b, .c)')
bs.select_one('p:is(.a, .b, .c)')
sv.select('p:is(.a, .b, .c)', bs)
sv.select_one('p:is(.a, .b, .c)', bs)
If you have the complete standard library and the real re
module (google for IronPython community edition) it might work. But IronPython is an incredible bad python implementation, I wouldn't count on that.
Besides, give html5lib
a try. That parser parses with the same rules firefox parses documents.
© 2022 - 2024 — McMap. All rights reserved.