Iron python, beautiful soup, win32 app
Asked Answered
B

10

21

Does beautiful soup work with iron python? If so with which version of iron python? How easy is it to distribute a windows desktop app on .net 2.0 using iron python (mostly c# calling some python code for parsing html)?

Benedict answered 23/9, 2008 at 1:37 Comment(0)
H
34

I was asking myself this same question and after struggling to follow advice here and elsewhere to get IronPython and BeautifulSoup to play nicely with my existing code I decided to go looking for an alternative native .NET solution. BeautifulSoup is a wonderful bit of code and at first it didn't look like there was anything comparable available for .NET, but then I found the HTML Agility Pack and if anything I think I've actually gained some maintainability over BeautifulSoup. It takes clean or crufty HTML and produces a elegant XML DOM from it that can be queried via XPath. With a couple lines of code you can even get back a raw XDocument and then craft your queries in LINQ to XML. Honestly, if web scraping is your goal, this is about the cleanest solution you are likely to find.

Edit

Here is a simple (read: not robust at all) example that parses out the US House of Representatives holiday schedule:

using System;
using System.Collections.Generic;
using HtmlAgilityPack;

namespace GovParsingTest
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlWeb hw = new HtmlWeb();
            string url = @"http://www.house.gov/house/House_Calendar.shtml";
            HtmlDocument doc = hw.Load(url);

            HtmlNode docNode = doc.DocumentNode;
            HtmlNode div = docNode.SelectSingleNode("//div[@id='primary']");
            HtmlNodeCollection tableRows = div.SelectNodes(".//tr");

            foreach (HtmlNode row in tableRows)
            {
                HtmlNodeCollection cells = row.SelectNodes(".//td");
                HtmlNode dateNode = cells[0];
                HtmlNode eventNode = cells[1];

                while (eventNode.HasChildNodes)
                {
                    eventNode = eventNode.FirstChild;
                }

                Console.WriteLine(dateNode.InnerText);
                Console.WriteLine(eventNode.InnerText);
                Console.WriteLine();
            }

            //Console.WriteLine(div.InnerHtml);
            Console.ReadKey();
        }
    }
}
Harelda answered 4/10, 2008 at 19:4 Comment(1)
HAP is a good solution, I've got a bunch of apps in production using it. I've used Mozilla Html Parser and there's not much difference.Supramolecular
S
8

I've tested and used BeautifulSoup with both IPy 1.1 and 2.0 (forget which beta, but this was a few months back). Leave a comment if you are still having trouble and I'll dig out my test code and post it.

Sabinasabine answered 23/9, 2008 at 1:53 Comment(0)
L
5

If BeautifulSoup doesn't work on IronPython, it's because IronPython doesn't implement the whole Python language (the same way CPython does). BeautifulSoup is pure-python, no C-extensions, so the only problem is the compatibility of IronPython with CPython in terms of Python source code.There shouldn't be one, but if there is, the error will be obvious ("no module named ...", "no method named ...", etc.). Google says that only one of BS's tests fails with IronPython. it probably works, and that test may be fixed by now. I wouldn't know.

Try it out and see, would be my advice, unless anybody has anything more concrete.

Lalalalage answered 23/9, 2008 at 1:43 Comment(0)
O
2

Also, regarding one of the previous comments about compiling with -X:SaveAssemblies - that is wrong. -X:SaveAssemblies is meant as a debugging feature. There is a API meant for compiling python code into binaries. This post explains the API and the difference between the two modes.

Orthopedist answered 23/9, 2008 at 20:16 Comment(0)
O
1

Regarding the second part of your question, you can use the DLR Hosting APIs to run IronPython code from within a C# application. The DLR hosting spec is here. This blog also contains some sample hosting applications

Orthopedist answered 23/9, 2008 at 20:10 Comment(0)
W
1

We are distributing a 40k line IronPython application. We have not been able to compile the whole thing into a single binary distributable. Instead we have been distributing it as a zillion tiny dlls, one for each IronPython module. This works fine though.

However, on the newer release, IronPython 2.0, we have a recent spike which seems to be able to compile everything into a single binary file. This also results in faster application start-up too (module importing is faster.) Hopefully this spike will migrate into our main tree in the next few days.

To do the distribution we are using WiX, which is a Microsoft internal tool for creating msi installs, that has been open-sourced (or made freely available, at least.) It has given us no problems, even though our install has some quite fiddly requirements. I will definitely look at using WiX to distribute other IronPython projects in the future.

World answered 13/11, 2008 at 15:31 Comment(0)
B
1

Seems to work just fine with IronPython 2.7. Just need to point it at the right folder and away you go:

D:\Code>ipy
IronPython 2.7 (2.7.0.40) on .NET 4.0.30319.235
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.append("D:\Code\IronPython\BeautifulSoup-3.2.0")
>>> import urllib2
>>> from BeautifulSoup import BeautifulSoup
>>> page = urllib2.urlopen("http://www.example.com")
>>> soup = BeautifulSoup(page)
<string>:1: DeprecationWarning: object.__new__() takes no parameters
>>> i = soup('img')[0]
>>> i['src']
'http://example.com/blah.png'
Berretta answered 1/7, 2011 at 14:24 Comment(0)
P
0

I haven't tested it, but I'd say it'll most likely work with the latest IPy2.

As for distribution, it's very simple. Use the -X:SaveAssemblies option to compile your Python code down to a binary and then ship it with your other DLLs and the IPy dependencies.

Protozoan answered 23/9, 2008 at 1:42 Comment(0)
W
0

Yes, it's possible. I'm using ironpython v3.4.0 with the latest versions of bs4 (v4.12.2) and soupsieve (v2.4.1).

Copy bs4 and soupsieve folders from your cpython env to your {IPYTHON_DIR}/lib/site-packages folder. Alternately, you can put them elsewhere and call sys.path.append() to add the directory.

Edit bs4\builder\_lxml.py and comment out the following lines:

        # if len(markup) > 0 and markup[0] == u'\N{BYTE ORDER MARK}':
        #   markup = markup[1:]

If anyone knows how to make the above snippet compatible with ipython 3.4, please suggest edits.

Now, fire up your ipy console...

import bs4, soupsieve as sv

text = """<div><!-- These are animals --><p class="a">Cat</p><p class="b">Dog</p><p class="c">Mouse</p></div>"""
bs = bs4.BeautifulSoup(text)

bs.select('p:is(.a, .b, .c)')
bs.select_one('p:is(.a, .b, .c)')

sv.select('p:is(.a, .b, .c)', bs)
sv.select_one('p:is(.a, .b, .c)', bs)
Whey answered 2/6, 2023 at 7:29 Comment(0)
P
-2

If you have the complete standard library and the real re module (google for IronPython community edition) it might work. But IronPython is an incredible bad python implementation, I wouldn't count on that.

Besides, give html5lib a try. That parser parses with the same rules firefox parses documents.

Perquisite answered 23/9, 2008 at 7:58 Comment(2)
I don't use IronPython, but what I've read so far about it does not certify the "incredibly bad python implementation" [typo fixed].Lyre
I certainly don't consider IronPython to be incredibly bad. It does just fine on loads of stuff. Just don't expect it to be a drop-in replacement for CPython.Off

© 2022 - 2024 — McMap. All rights reserved.