web scraping to fill out (and retrieve) search forms?
Asked Answered
O

4

9

I was wondering if it is possible to "automate" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journal articles for which I would like to get DOI's (digital object identifier); manually for this I would go to the journal articles search page (e.g., http://pubs.acs.org/search/advanced), type in the authors/title/volume (etc.) and then find the article out of its list of returned results, and pick out the DOI and paste that into my reference list. I use R and Python for data analysis regularly (I was inspired by a post on RCurl) but don't know much about web protocols... is such a thing possible (for instance using something like Python's BeautifulSoup?). Are there any good references for doing anything remotely similar to this task? I'm just as much interested in learning about web scraping and tools for web scraping in general as much as getting this particular task done... Thanks for your time!

Otti answered 23/7, 2009 at 7:11 Comment(2)
have you come up with a good solution to this problem? I found this after asking a similar (duplicate?) question here #9712039Ouachita
@David - nope, sorry. Have not gotten far enough with any option to comment...Otti
G
10

Beautiful Soup is great for parsing webpages- that's half of what you want to do. Python, Perl, and Ruby all have a version of Mechanize, and that's the other half:

http://wwwsearch.sourceforge.net/mechanize/

Mechanize let's you control a browser:

# Follow a link
browser.follow_link(link_node)

# Submit a form
browser.select_form(name="search")
browser["authors"] = ["author #1", "author #2"]
browser["volume"] = "any"
search_response = br.submit()

With Mechanize and Beautiful Soup you have a great start. One extra tool I'd consider is Firebug, as used in this quick ruby scraping guide:

http://www.igvita.com/2007/02/04/ruby-screen-scraper-in-60-seconds/

Firebug can speed your construction of xpaths for parsing documents, saving you some serious time.

Good luck!

Grundy answered 23/7, 2009 at 12:26 Comment(5)
Stephen! Mark me an answer! I'm racing a co-worker to 100 points :-)Grundy
I'm trying! I just got an OpenID but it tells me I have to have 15 reputation to vote up?? Sorry, first time on stackoverflow... is it this complicated?Otti
Heh, Thanks Stephen. You can always pick an answer, but you need 10 points to vote things up.Grundy
Ah... sorry I could not do more, but your answer was super helpful!Otti
What about Scrapy? in 2018 I think this is the most popular solutionDeina
B
7

Python Code: for search forms.

# import 
from selenium import webdriver

from selenium.common.exceptions import TimeoutException

from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0

from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# go to the google home page
driver.get("http://www.google.com")

# the page is ajaxy so the title is originally this:
print driver.title

# find the element that's name attribute is q (the google search box)
inputElement = driver.find_element_by_name("q")

# type in the search
inputElement.send_keys("cheese!")

# submit the form (although google automatically searches now without submitting)
inputElement.submit()

try:
    # we have to wait for the page to refresh, the last thing that seems to be updated is the title
    WebDriverWait(driver, 10).until(EC.title_contains("cheese!"))

    # You should see "cheese! - Google Search"
    print driver.title

finally:
    driver.quit()

Source: https://www.seleniumhq.org/docs/03_webdriver.jsp

Baumgardner answered 26/3, 2019 at 4:37 Comment(1)
Can you also do this with a currently running webbrowser ?Dawndawna
S
1
WebRequest req = WebRequest.Create("http://www.URLacceptingPOSTparams.com");

req.Proxy = null;
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";

//
// add POST data
string reqString = "searchtextbox=webclient&searchmode=simple&OtherParam=???";
byte[] reqData = Encoding.UTF8.GetBytes (reqString);
req.ContentLength = reqData.Length;
//
// send request
using (Stream reqStream = req.GetRequestStream())
  reqStream.Write (reqData, 0, reqData.Length);

string response;
//
// retrieve response
using (WebResponse res = req.GetResponse())
using (Stream resSteam = res.GetResponseStream())
using (StreamReader sr = new StreamReader (resSteam))
  response = sr.ReadToEnd();

// use a regular expression to break apart response
// OR you could load the HTML response page as a DOM 

(Adapted from Joe Albahri's "C# in a nutshell")

Substructure answered 23/7, 2009 at 7:24 Comment(1)
Thank you - good to know it is possible! ...I am guessing. (not too familiar with .NET, though I hear it is all the rage...)Otti
H
0

There are many tools for web scraping. There is a good firefox plugin called iMacros. It works great and needs no programming knowledge at all. The free version can be downloaded from here: https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/ The best thing about iMacros, is that it can get you started in minutes, and it can also be launched from the bash command line, and can also be called from within bash scripts.

A more advanced step would be selenium webdrive. The reason I chose selenium is that it is documented in a great way suiting beginners. reading just the following page:

would get you upand running in no time. Selenium supports java, python, php , c so if you are familiar with any of these languages, you would be familiar with all the commands needed. I prefer webdrive variation of selenium, as it opens a browser, so that you can check the fields and outputs. After setting up the script using webdrive, you can easily migrate the script to IDE, thus running headless.

To install selenium you can do by typing the command

sudo easy_install selenium

This will take care of the dependencies and everything needed for you.

In order to run your script interactively, just open a terminal, and type

python

you will see the python prompt, >>> and you can type in the commands.

Here is a sample code which you can paste in the terminal, it will search google for the word cheeses

package org.openqa.selenium.example;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;

public class Selenium2Example  {
    public static void main(String[] args) {
        // Create a new instance of the Firefox driver
        // Notice that the remainder of the code relies on the interface, 
        // not the implementation.
        WebDriver driver = new FirefoxDriver();

        // And now use this to visit Google
        driver.get("http://www.google.com");
        // Alternatively the same thing can be done like this
        // driver.navigate().to("http://www.google.com");

        // Find the text input element by its name
        WebElement element = driver.findElement(By.name("q"));

        // Enter something to search for
        element.sendKeys("Cheese!");

        // Now submit the form. WebDriver will find the form for us from the element
        element.submit();

        // Check the title of the page
        System.out.println("Page title is: " + driver.getTitle());

        // Google's search is rendered dynamically with JavaScript.
        // Wait for the page to load, timeout after 10 seconds
        (new WebDriverWait(driver, 10)).until(new ExpectedCondition<Boolean>() {
            public Boolean apply(WebDriver d) {
                return d.getTitle().toLowerCase().startsWith("cheese!");
            }
        });

        // Should see: "cheese! - Google Search"
        System.out.println("Page title is: " + driver.getTitle());

        //Close the browser
        driver.quit();
    }}

I hope that this can give you a head start.

Cheers :)

Hannan answered 27/8, 2014 at 18:4 Comment(1)
First you're telling the user to install the Selenium Client for Python; but your code-example is Java code. This is confusing.Prier

© 2022 - 2024 — McMap. All rights reserved.