Python Beautiful Soup parsing a UTF-8 coded table (using mechanize)
Asked Answered
J

1

0

I'm trying to parse the following table, coded in UTF-8 (this is part of it):

<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
                            <tr class="gridHeader" valign="top">
                                <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>    
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
                            </tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">

My code is:

html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html)

table_id = "ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1"
table = soup.findall("table", id=table_id)

And I'm getting the following error:

TypeError: 'NoneType' object is not callable
Jodijodie answered 25/10, 2013 at 15:2 Comment(11)
Hey, can you show the full traceback?Exarch
Current code is at: dpaste.de/w6cVJodijodie
@alKid I'm sorry, but what do you mean by full traceback?Jodijodie
All the full traceback. Or.. is that the only thing that popped out when executing the script?Exarch
@alKid yep. The problematic line is table = soup.findall("table", id=table_id). I changed it to table = soup.find(id=table_id) according to @GamesBrainiac and now it returns NoneJodijodie
@Jodijodie We can't get access to that, url for some reason. Atleast I can't.Wop
@GamesBrainiac dpaste.de/Nb4b -I created a new one, does that help?Jodijodie
I'm getting access denied. I get a You don't have permission to access "http://www.tase.co.il/Heb/General/Company/Pages/companyHistoryData.aspx?" on this server.Wop
@GamesBrainiac Ohh, you meant the TAES url. Will it help if I paste the HTML source in dpaste.de ?Jodijodie
@Jodijodie Well no, because thats what you just did, right?Wop
@GamesBrainiac here it is: dpaste.de/EWCK - and in case I forgot - thank you so much for trying to help !!Jodijodie
W
1

Since you are just finding using an id, you can just use id and nothing else, because ids are unique:

UPDATE

Using your paste:

# encoding=utf-8
from bs4 import BeautifulSoup
import requests

data = requests.get('https://dpaste.de/EWCK/raw/')
soup = BeautifulSoup(data.text)
print soup.find("table",
                id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

I'm using python requests to get the data from a webpage, its same as as you trying to get the data. The above code works, and the correct ID is given. Try this for a change, don't use .decode('utf-8'), instead, just use br.response().read().

Wop answered 25/10, 2013 at 15:6 Comment(8)
Thanks, but now it returns NoneJodijodie
@Jodijodie Thats strange, it works fine on my computer, is it because you're using canopy? Try running this with just normal python.Wop
@Exarch How does it not?Wop
@Jodijodie I think I know where the problem is, when you encode it into ASCII, what happens is that you get lots of slashes. This messes up BeautifulSoup's parsing.Wop
@GamesBrainiac I don't understand. I removed the .decode and it's it's still returns None. If you're telling me to use requests then I have a problem with that - that's why I'm using mechanizeJodijodie
Use mechanize, keep everything as is, just remove the decode part.Wop
@GamesBrainiac I'm sorry, but it returns None... code is at: dpaste.de/DrdDJodijodie
@Jodijodie Then I'm really out of ideas, since I don't have access to your side, so I can't see anything for myself.Wop

© 2022 - 2024 — McMap. All rights reserved.