Extract the first paragraph from a Wikipedia article (Python)
Asked Answered
U

10

43

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a theoretical physicist, philosopher and author who is widely regarded as one of the most influential and iconic scientists and intellectuals of all time. A German-Swiss Nobel laureate, Einstein is often regarded as the father of modern physics.[2] He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".[3]

Unbuild answered 16/12, 2010 at 12:49 Comment(2)
urllib for getting the page and BeautifulSoup for parsing HTML. Though there are other ways of doing it, search for them on StackOverflow itself. This has been discussed lots of times.Kind
what markup do you want it in? mediawiki, html?Aecium
M
45

Some time ago I made two classes for get Wikipedia articles in plain text. I know that they aren't the best solution, but you can adapt it to your needs:

    wikipedia.py
    wiki2plain.py

You can use it like this:

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text
Mouthwatering answered 16/12, 2010 at 14:12 Comment(2)
In pastebin.com/FVDxLWNG #REDIRECT does not work for it.wikipedia.org, it must be translated to italian, like #RINVIA. I suspect #REDIRECT works just for English.Southerland
@joksnet, I believe using your userdefined class can be misleading, as the name sort of conflicts with [wikipedia's python API] (pypi.org/project/wikipedia)Wordbook
C
54

I wrote a Python library that aims to make this very easy. Check it out at Github.

To install it, run

$ pip install wikipedia

Then to get the first paragraph of an article, just use the wikipedia.summary function.

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

prints

Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] ( listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his mass–energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

As far as how it works, wikipedia makes a request to the Mobile Frontend Extension of the MediaWiki API, which returns mobile friendly versions of Wikipedia articles. To be specific, by passing the parameters prop=extracts&exsectionformat=plain, the MediaWiki servers will parse the Wikitext and return a plain text summary of the article you are requesting, up to and including the entire page text. It also accepts the parameters exchars and exsentences, which, not surprisingly, limit the number of characters and sentences returned by the API.

Cesaria answered 21/10, 2013 at 23:50 Comment(3)
The library is very well designed, and pretty easy to use! Good job. :)Darkish
prop=extracts was split out of MobileFrontend into a separate TextExtracts extension in 2014, but the API call is unchanged.Luzon
+1 for this nice library. I am working on a big project in which ~6k pages should be invoked. Any recommendation on how to use Wikipedia in this case? I mean rather than manually writing a list of page titles to feed into wikipedia.page()Ephemerid
M
45

Some time ago I made two classes for get Wikipedia articles in plain text. I know that they aren't the best solution, but you can adapt it to your needs:

    wikipedia.py
    wiki2plain.py

You can use it like this:

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text
Mouthwatering answered 16/12, 2010 at 14:12 Comment(2)
In pastebin.com/FVDxLWNG #REDIRECT does not work for it.wikipedia.org, it must be translated to italian, like #RINVIA. I suspect #REDIRECT works just for English.Southerland
@joksnet, I believe using your userdefined class can be misleading, as the name sort of conflicts with [wikipedia's python API] (pypi.org/project/wikipedia)Wordbook
L
14

Wikipedia runs a MediaWiki extension that provides exactly this functionality as an API module. TextExtracts implements action=query&prop=extracts with options to return the first N sentences and/or just the introduction, as HTML or plain text.

Here's the API call you want to make, try it: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext=&redirects=&formatversion=2

  • action=query&prop=extracts to request this info
  • (ex)sentences=2, (ex)intro=, (ex)plaintext, are parameters to the module (see the first link for its API doc) asking for two sentences from the intro as plain text; leave off the latter for HTML.
  • redirects=(true) so if you ask for "titles=Einstein" you'll get the Albert Einstein page info
  • formatversion=2 for a cleaner format in UTF-8.

There are various libraries that wrap invoking the MediaWiki action API, such as the one in DGund's answer, but it's not too hard to make the API calls yourself.

Page info in search results discusses getting this text extract, along with getting a description and lead image for articles.

Luzon answered 11/11, 2015 at 5:20 Comment(0)
P
11

What I did is this:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p
Phase answered 21/5, 2011 at 14:57 Comment(1)
Note: Python 3.x users will find that urllib2 is depracated. urllib should work be the only class to parse and work with 'url'.Wordbook
H
5

The relatively new REST API has a summary method that is perfect for this use, and does a lot of the things mentioned in the other answers here (e.g. removing wikicode). It even includes an image and geocoordinates if applicable.

Using the lovely requests module and Python 3:

import requests
r = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Amsterdam")
page = r.json()
print(page["extract"]) # Returns 'Amsterdam is the capital and...'
Herrah answered 7/6, 2018 at 12:43 Comment(0)
K
2

If you want library suggestions, BeautifulSoup, urllib2 come to mind. Answered on SO before: Web scraping with Python.

I have tried urllib2 to get a page from Wikipedia. But, it was 403 (forbidden). MediaWiki provides API for Wikipedia, supporting various output formats. I haven't used python-wikitools, but may be worth a try. http://code.google.com/p/python-wikitools/

Koby answered 16/12, 2010 at 12:56 Comment(1)
probably wikipedia is blocking some useragent :)Puglia
A
2

First, I promise I am not being snarky.

Here's a previous question that might be of use: Fetch a Wikipedia article with Python

In this someone suggests using the wikipedia high level API, which leads to this question:

Is there a Wikipedia API?

Annuity answered 16/12, 2010 at 13:2 Comment(0)
M
2

As others have said, one approach is to use the wikimedia API and urllib or urllib2. The code fragments below are part of what I used to extract what is called the "lead" section, which has the article abstract and the infobox. This will check if the returned text is a redirect instead of actual content, and also let you skip the infobox if present (in my case I used different code to pull out and format the infobox.

contentBaseURL='http://en.wikipedia.org/w/index.php?title='

def getContent(title):
    URL=contentBaseURL+title+'&action=raw&section=0'
    f=urllib.urlopen(URL)
    rawContent=f.read()
    return rawContent

infoboxPresent = 0
# Check if a redirect was returned.  If so, go to the redirection target
    if rawContent.find('#REDIRECT') == 0:
        rawContent = getFullContent(title)
        # extract the redirection title
        # Extract and format the Infobox
        redirectStart=rawContent.find('#REDIRECT[[')+11   
        count = 0
        redirectEnd = 0
        for i, char in enumerate(rawContent[redirectStart:-1]):
            if char == "[": count += 1
            if char == "]}":
                count -= 1
                if count == 0:
                    redirectEnd = i+redirectStart+1
                    break
        redirectTitle = rawContent[redirectStart:redirectEnd]
        print 'redirectTitle is: ',redirectTitle
        rawContent = getContent(redirectTitle)

    # Skip the Infobox
    infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
    count = 0
    infoboxEnd = 0
    for i, char in enumerate(rawContent[infoboxStart:-1]):
        if char == "{": count += 1
        if char == "}":
            count -= 1
            if count == 0:
                infoboxEnd = i+infoboxStart+1
                break

    if infoboxEnd <> 0:
        rawContent = rawContent[infoboxEnd:]

You'll be getting back the raw text including wiki markup, so you'll need to do some clean up. If you just want the first paragraph, not the whole first section, look for the first new line character.

Myotonia answered 21/12, 2010 at 2:29 Comment(0)
P
0

Try a combination of urllib to fetch the site and BeautifulSoup or lxml to parse the data.

Precipitant answered 16/12, 2010 at 12:55 Comment(1)
I'm very happy to parse html by hand. hoooo yeahhhPuglia
O
0

Try pattern.

pip install pattern

from pattern.web import Wikipedia
article = Wikipedia(language="af").search('Kaapstad', throttle=10)
print article.string
Olympia answered 21/7, 2014 at 20:50 Comment(2)
Cannot 'pip3 install pattern' for python3.6 ... SyntaxError: Missing parentheses in call to 'print'Biddable
Sadly it seems pattern is Python 2 only at the momentOlympia

© 2022 - 2024 — McMap. All rights reserved.