mechanize submit form character encoding problem
Asked Answered
M

2

6

I am trying to scrape http://www.nscb.gov.ph/ggi/database.asp, specifically all the tables you get from selecting the municipalities/provinces. I am using python with lxml.html and mechanize. my scraper works fine so far, however I get HTTP Error 500: Internal Server Error when submitting the municipality[19] "Peñarrubia, Abra". I suspect this is due to the character encoding. My guess is that the ene character (n with a tilde above) causes this problem. How can I fix this?

A working example of this part of my script is shown below. As I am just starting out in python (and often use snippets I find on SO), any further comments are greatly appreciated.

from BeautifulSoup import BeautifulSoup
import mechanize
import lxml.html
import csv



class PrettifyHandler(mechanize.BaseHandler):
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = mechanize.response_seek_wrapper(response)
        # only use BeautifulSoup if response is html
        if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
            soup = BeautifulSoup(response.get_data())
            response.set_data(soup.prettify())
        return response

site = "http://www.nscb.gov.ph/ggi/database.asp"

output_mun = csv.writer(open(r'output-municipalities.csv','wb'))
output_prov = csv.writer(open(r'output-provinces.csv','wb'))

br = mechanize.Browser()
br.add_handler(PrettifyHandler())


# gets municipality stats
response = br.open(site)
br.select_form(name="form2")
muns = br.find_control("strMunicipality2", type="select").items
# municipality #19 is not working, those before do
for pos, item in enumerate(muns[19:]): 
    br.select_form(name="form2")
    br["strMunicipality2"] = [item.name]
    print pos, item.name 
    response = br.submit(id="button2", type="submit")
    html = response.read()
    root = lxml.html.fromstring(html)
    table = root.xpath('//table')[1]
    data = [
               [td.text_content().strip() for td in row.findall("td")] 
               for row in table.findall("tr")
           ]
    print data, "\n"
    for row in data[2:]:
        if row: 
            row.append(item.name)
            output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
    response = br.open(site) #go back button not working

# provinces follow here

Thank you very much!

edit: to be specific, the error occur on this line

response = br.submit(id="button2", type="submit")
Minda answered 7/7, 2011 at 11:57 Comment(6)
Interesting question. I had a crack at solving it but came up with nothing. It seems to me that the problem is not in your own code as if you change the encoding of item.name mechanize will throw insufficient items with name 'whatever_here'. So it seems that by using item.name form selection happens correctly, but then on "send" the wrong data is passed to the server. I noticed that the page you are scraping is in iso-8859-1, not utf-8, but chaning the encoding to latin did not work either. Curious to see if somebody will solve!Brouhaha
I also tried to change the encoding of mechanize as suggested here by setting br._factory.encoding = "iso-8859-1", br._factory._forms_factory.encoding = "iso-8859-1", br._factory._links_factory._encoding = "iso-8859-1", but it didn't work.Minda
Figure out how to set "Content-Type" response header and put value to "text/html; charset=iso-8859-1".Ryon
I have tried the solution from the mechanize documentation, but to no avail. oddly the error occurs when submitting the form.Minda
sniff a browser request with these values and a request with your code (you can use wireshark) perhaps you should send form data in the same encoding the server tells you in the Content-Type header of the page has the form (the servers sends Content-Type: text/html without charset) so the browser normally picks the content from the page (latin1) does a browser request worksImport
item.name is utf8 the server suppose to have data in latin-1 br["strMunicipality2"] = [ item.name.decode('utf-8').encode('latin-1') ] raises an error insufficient items with name 'Pe\xf1arrubia+Abra' looks like mechanize uses utf8 internally but sends form data as it, I'm looking where to hook to convert text in latin-1 before sending the formImport
I
1

Ok ,found it. It's beautiful soup that converts to unicode and prettify returns utf-8 by default. You should use:

response.set_data(soup.prettify(encoding='latin-1'))
Import answered 18/7, 2011 at 12:48 Comment(1)
i had totally forgotten about beautiful soup. thanks very much!Minda
I
1

quick and dirty hack:

def _pairs(self):
    return [(k, v.decode('utf-8').encode('latin-1')) for (i, k, v, c_i) in self._pairs_and_controls()]

from mechanize import HTMLForm
HTMLForm._pairs = _pairs

or something less invasive (I think there are no other solutions because the class Item protects 'name' field)

item.__dict__['name'] = item.name.decode('utf-8').encode('latin-1')

before

br["strMunicipality2"] = [item.name]
Import answered 15/7, 2011 at 0:9 Comment(5)
both solutions work. (the only difference is that with the second one the ene character is displayed as a grey blank in the terminal). but could you explain whhat these solutions do exactly, and why they work? I am new to all of this, and I had tried br["strMunicipality2"] = [ item.name.decode('utf-8').encode('latin-1') ], which didn't work. how exactly is item.__dict__['name'] = item.name.decode('utf-8').encode('latin-1') different?Minda
actually, the second solution does not work. it only works when I start the loop with the specific municipality for pos, item in enumerate(muns[19:]):, if I remove the [19:] and other municipalities are submitted before the one in question, I still get insufficient items with name 'Pe\xf1arrubia+Abra'. The first solution does seem to work.Minda
mechanize picks select options from the page, I'm not sure why, but it converts somewhere item names in utf8 then selecting an item with br['xxx'] = [ 'abc' ] it does a lookup in form values, so you should pass names as it has in its item list. when it creates the post all key pairs are urlencoded, mechanize urlquotes values as is (but it has converted in utf8). the page are you querying expect values in latin1Import
to see the correct char in terminal you should use a locale (check LANG evironment variable) that uses latin1 (iso-8859-1) and also the terminal should display as latin1 (most probably your termial has utf8)Import
the second solution may not work, or even more likely the first one because it exploits internals of mechanize, and this can be different in different versions, I need to look deeper to see if it's possible to fix without highjacking the mechanize codeImport
I
1

Ok ,found it. It's beautiful soup that converts to unicode and prettify returns utf-8 by default. You should use:

response.set_data(soup.prettify(encoding='latin-1'))
Import answered 18/7, 2011 at 12:48 Comment(1)
i had totally forgotten about beautiful soup. thanks very much!Minda

© 2022 - 2024 — McMap. All rights reserved.