How to access Wikipedia from R?

F

3

11

Is there any package for R that allows querying Wikipedia (most probably using Mediawiki API) to get list of available articles relevant to such query, as well as import selected articles for text mining?

Flexure answered 23/5, 2011 at 10:28 Comment(1)

You might find the following useful: ragtag.info/2011/feb/10/processing-every-wikipedia-article – Sealer 23/5, 2011 at 10:57

G

8

Use the RCurl package for retreiving info, and the XML or RJSONIO packages for parsing the response.

If you are behind a proxy, set your options.

opts <- list(
  proxy = "136.233.91.120", 
  proxyusername = "mydomain\\myusername", 
  proxypassword = 'whatever', 
  proxyport = 8080
)

Use the getForm function to access the API.

search_example <- getForm(
  "http://en.wikipedia.org/w/api.php", 
  action  = "opensearch", 
  search  = "Te", 
  format  = "json",
  .opts   = opts
)

Parse the results.

fromJSON(rawToChar(search_example))

Glassine answered 23/5, 2011 at 13:39 Comment(1)

I'm having problems with using this for some search terms, but I suspect it is an issue with the network I'm on. I need volunteers to check the sample code with different strings in the search parameter. – Glassine 23/5, 2011 at 13:43

V

12

There is WikipediR, 'A MediaWiki API wrapper in R'

library(devtools)
install_github("Ironholds/WikipediR")
library(WikipediR)

It includes these functions:

ls("package:WikipediR")
 [1] "wiki_catpages"      "wiki_con"           "wiki_diff"          "wiki_page"         
 [5] "wiki_pagecats"      "wiki_recentchanges" "wiki_revision"      "wiki_timestamp"    
 [9] "wiki_usercontribs"  "wiki_userinfo"

Here it is in use, getting the contribution details and user details for a bunch of users:

library(RCurl)
library(XML)

# scrape page to get usernames of users with highest numbers of edits
top_editors_page <- "http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits"
top_editors_table <- readHTMLTable(top_editors_page)
very_top_editors <- as.character(top_editors_table[[3]][1:5,]$User)

# setup connection to wikimedia project 
con <- wiki_con("en", project = c("wikipedia"))

# connect to API and get last 50 edits per user
user_data <- lapply(very_top_editors,  function(i) wiki_usercontribs(con, i) )
# and get information about the users (registration date, gender, editcount, etc)
user_info <- lapply(very_top_editors,  function(i) wiki_userinfo(con, i) )

Vishnu answered 4/6, 2014 at 2:4 Comment(0)

G

8

Use the RCurl package for retreiving info, and the XML or RJSONIO packages for parsing the response.

If you are behind a proxy, set your options.

opts <- list(
  proxy = "136.233.91.120", 
  proxyusername = "mydomain\\myusername", 
  proxypassword = 'whatever', 
  proxyport = 8080
)

Use the getForm function to access the API.

search_example <- getForm(
  "http://en.wikipedia.org/w/api.php", 
  action  = "opensearch", 
  search  = "Te", 
  format  = "json",
  .opts   = opts
)

Parse the results.

fromJSON(rawToChar(search_example))

Glassine answered 23/5, 2011 at 13:39 Comment(1)

I'm having problems with using this for some search terms, but I suspect it is an issue with the network I'm on. I need volunteers to check the sample code with different strings in the search parameter. – Glassine 23/5, 2011 at 13:43

W

2

A new great possibility is the wikifacts package (on CRAN):

library(wikifacts)
wiki_define('R (programming language)')
## R (programming language) 
## "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity; as of April 2021, R ranks 16th in the TIOBE index, a measure of popularity of programming languages.The official R software environment is a GNU package.\nIt is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems."

Walli answered 30/5, 2021 at 18:36 Comment(0)

Recommended topics

Hot tags