a (presumably basic) web scraping of http://www.ssa.gov/cgi-bin/popularnames.cgi in urllib
Asked Answered
F

4

7

I am very new to Python (and web scraping). Let me ask you a question.

Many website actually do not report its specific URLs in Firefox or other browsers. For example, Social Security Admin shows popular baby names with ranks (since 1880), but the url does not change when I change the year from 1880 to 1881. It is constantly,

http://www.ssa.gov/cgi-bin/popularnames.cgi

Because I don't know the specific URL, I could not download the webpage using urllib.

In this page source, it includes:

<input type="text" name="year" id="yob" size="4" value="1880">

So presumably, if I can control this "year" value (like, "1881" or "1991"), I can deal with this problem. Am I right? I still don't know how to do it.

Can anybody tell me the solution for this please?

If you know some websites that may help my study, please let me know.

THANKS!

Felony answered 20/6, 2013 at 18:23 Comment(0)
P
7

You can still use urllib. The button performs a POST to the current url. Using Firefox's Firebug I took a look at the network traffic and found they're sending 3 parameters: member, top, and year. You can send the same arguments:

import urllib
url = 'http://www.ssa.gov/cgi-bin/popularnames.cgi'

post_params = { # member was blank, so I'm excluding it.
    'top'  : '25',
    'year' : year
    }
post_args = urllib.urlencode(post_params)

Now, just send the url-encoded arguments:

urllib.urlopen(url, post_args)

If you need to send headers as well:

headers = {
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' : 'en-US,en;q=0.5',
    'Connection' : 'keep-alive',
    'Host' : 'www.ssa.gov',
    'Referer' : 'http://www.ssa.gov/cgi-bin/popularnames.cgi',
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0'
    }

# With POST data:
urllib.urlopen(url, post_args, headers)

Execute the code in a loop:

for year in xrange(1880, 2014):
    # The above code...
Pseudohermaphroditism answered 20/6, 2013 at 18:59 Comment(11)
This is amazing! I really appreciate you!Felony
Any time! Don't forget to mark this question as correct to help anybody else that might visit this page in the future! =)Pseudohermaphroditism
In my eyes, the most important points in your answer are post_arg (is it post argument?), headers, and parameters that can be traced with Firebug! I did not know them at all and I just learned them. Thank you again.Felony
Yes, post_args = post arguments as in the arguments sent in an POST. I'm glad I could teach you something! And again, if this answer helped you, you should accept it as correct (click the little checkmark under the answer score) to help anybody that might read you question in the future =)Pseudohermaphroditism
Aha, I just checked it now. I am also new to stackoverflow.com :)Felony
So, wishing to solve the exact same problem as @Hyun, I came across, but found that this method doesn't seem to be working (I only ever get the default of the top 10 names for 2014). I've dug around in Chrome's devtools for a while, but can't figure out why this no longer works. Ideas?Transmute
@Transmute Interesting. I get the same results when running the posted code. Did you check to see if member was being posted now? When I answered this question it was being omitted.Pseudohermaphroditism
@Pseudohermaphroditism Sorry can you clarify "the same results"? Do you mean the same problem I'm having, the same correct results as when you initially answered the question? And I was trying to figure out if "member" was being passed, but couldn't figure out how to in either Chrome devtools or firebug.Transmute
@Transmute Sorry, I meant I was getting the same results as you. To check what is in the HTTP POST request in Firebug, click the Net tab (you might have to click to enable at this point if its your first time). Click Persist if the POST results in a redirect to retain the request information between pages. You'll see the request type (usually GET or POST) alongside the URL. Click on the POST to expand. You'll have even more options at this point including Headers and Post. Click Post to see which parameters were sent with what values.Pseudohermaphroditism
Well, I looked and it looks just to be year and top, with an optional n parameter (which determines if it gives percentage of births or raw counts). But whether I include that or not (and whether or not I include header info) I reliably just get the top 10 names for 2014....not sure what the problem is.Transmute
Check to see if any additional requests are made after the page loads. It could be that the page is now making AJAX requests to get the remaining names.Pseudohermaphroditism
B
3

I recommend using Scrapy. It's a very powerful and easy-to-use tool for web-scraping. Why it is worth trying:

  1. Speed/performance/efficiency

    Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.

  2. Database pipelining

    Scrapy has Item Pipelines feature:

    After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

    So, each page can be written to the database immediately after it has been downloaded.

  3. Code organization

    Scrapy offers you a nice and clear project structure, there you have settings, spiders, items, pipelines etc separated logically. Even that makes your code clearer and easier to support and understand.

  4. Time to code

    Scrapy does a lot of work for you behind the scenes. This makes you focus on the actual code and logic itself and not to think about the "metal" part: creating processes, threads etc.

Yeah, you got it - I love it.

In order to get started:

Hope that helps.

Blunge answered 20/6, 2013 at 19:24 Comment(0)
H
2

I recommend using a tool such as mechanize. This will allow you to programmatically navigate web pages using python. There are many tutorials on how to use this. Basically, what you'll want to do in mechanize is the same you do in the browser: fill the textbox, hit the "Go" button and parse the webpage you get from the response.

Heerlen answered 20/6, 2013 at 18:33 Comment(0)
S
2

I've used mechanoid/BeautifulSoup libraries for stuff like this previously. If I had a project like this now I'd also look at https://github.com/scrapy/scrapy

Sacken answered 20/6, 2013 at 19:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.