How to get plain text out of Wikipedia
Asked Answered
N

13

44

I'd like to write a script that gets the Wikipedia description section only. That is, when I say

/wiki bla bla bla

it will go to the Wikipedia page for bla bla bla, get the following, and return it to the chatroom:

"Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"

How can I do this?

Northern answered 15/12, 2010 at 16:9 Comment(4)
So you want to extract the first paragraph?Melindamelinde
My answer to this question may help you. The TextExtracts extension to the API allows for more or less plain text extraction from articles.Indulge
Possible duplicate of Extract the first paragraph from a Wikipedia article (Python)Cystic
Related: https://mcmap.net/q/375113/-wikipedia-text-downloadFutility
F
51

Here are a few different possible approaches; use whichever works for you. All my code examples below use requests for HTTP requests to the API; you can install requests with pip install requests if you have Pip. They also all use the Mediawiki API, and two use the query endpoint; follow those links if you want documentation.

1. Get a plain text representation of either the entire page or the page "extract" straight from the API with the extracts prop

Note that this approach only works on MediaWiki sites with the TextExtracts extension. This notably includes Wikipedia, but not some smaller Mediawiki sites like, say, http://www.wikia.com/

You want to hit a URL like

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

Breaking that down, we've got the following parameters in there (documented at https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):

  • action=query, format=json, and title=Bla_Bla_Bla are all standard MediaWiki API parameters
  • prop=extracts makes us use the TextExtracts extension
  • exintro limits the response to content before the first section heading
  • explaintext makes the extract in the response be plain text instead of HTML

Then parse the JSON response and extract the extract:

>>> import requests
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'extracts',
...         'exintro': True,
...         'explaintext': True,
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2. Get the full HTML of the page using the parse endpoint, parse it, and extract the first paragraph

MediaWiki has a parse endpoint that you can hit with a URL like https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla to get the HTML of a page. You can then parse it with an HTML parser like lxml (install it first with pip install lxml) to extract the first paragraph.

For example:

>>> import requests
>>> from lxml import html
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'parse',
...         'page': 'Bla Bla Bla',
...         'format': 'json',
...     }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3. Parse wikitext yourself

You can use the query API to get the page's wikitext, parse it using mwparserfromhell (install it first using pip install mwparserfromhell), then reduce it down to human-readable text using strip_code. strip_code doesn't work perfectly at the time of writing (as shown clearly in the example below) but will hopefully improve.

>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'revisions',
...         'rvprop': 'content',
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links


Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino
Forespent answered 1/7, 2016 at 20:47 Comment(3)
Just make sure your "titles" are not capitalized. For example, "Data Integration" returns an error, but "data integration" works.Odaniel
@Odaniel More precisely, make sure they're capitalized in the same way that Wikipedia capitalizes them. For proper nouns, that usually means capitalizing every word, but for titles that aren't proper nouns it often means only capitalizing the first word.Forespent
This answer blows away the accepted answer. Bravo. It's irritating to me how difficult the wikipedia api description actually is. It never once mentions the prop parameter, for instance.Ciliolate
S
24

Use the MediaWiki API, which runs on Wikipedia. You will have to do some parsing of the data yourself.

For instance:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla

means

fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (title=Main%20Page) in JSON format (format=json).

You will probably want to search for the query and use the first result, to handle spelling errors and the like.

Seeker answered 15/12, 2010 at 16:12 Comment(2)
That is helpful.But i still don't get how exactly i should do itNorthern
@Wifi: I'm not going to write the code for you! You need to use urllib to go to a special URL, like the one above, and then use json to parse the result. You can work out which URL you need to access using the docs I linked to above.Seeker
N
14

You can get wiki data in Text formats. If you need to access many title's informations, you can get all title's wiki data in a single call. Use pipe character ( | ) to separate each titles.

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Here this api call return both Googles and Yahoos data.

explaintext => Return extracts as plain text instead of limited HTML.

exlimit = max (now its 20); Otherwise only one result will return.

exintro => Return only content before the first section. If you want full data, just remove this.

redirects= Resolve redirect issues.

Nationalist answered 10/6, 2015 at 18:36 Comment(0)
F
7

You can fetch just the first section using the API:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvsection=0&titles=Bla%20Bla%20Bla&rvprop=content

This will give you raw wikitext, you'll have to deal with templates and markup.

Or you can fetch the whole page rendered into HTML which has its own pros and cons as far as parsing:

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Bla_Bla_Bla

I can't see an easy way to get parsed HTML of the first section in a single call but you can do it with two calls by passing the wikitext you receive from the first URL back with text= in place of the page= in the second URL.

UPDATE

Sorry I neglected the "plain text" part of your question. Get the part of the article you want as HTML. It's much easier to strip HTML than to strip wikitext!

Frontward answered 15/12, 2010 at 16:34 Comment(0)
C
4

DBPedia is the perfect solution for this problem. Here: http://dbpedia.org/page/Metallica, look at the perfectly organised data using RDF. One can query for anything here at http://dbpedia.org/sparql using SPARQL, the query language for the RDF. There's always a way to find the pageID so as to get descriptive text but this should do for the most part.

There will be a learning curve for RDF and SPARQL for writing any useful code but this is the perfect solution.

For example, a query run for Metallica returns an HTML table with the abstract in several different languages:

<table class="sparql" border="1">
  <tr>
    <th>abstract</th>
  </tr>
  <tr>
    <td><pre>"Metallica is an American heavy metal band formed..."@en</pre></td>
  </tr>
  <tr>
    <td><pre>"Metallica es una banda de thrash metal estadounidense..."@es</pre></td>
... 

SPARQL QUERY :

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX dbres: <http://dbpedia.org/resource/>

SELECT ?abstract WHERE {
 dbres:Metallica dbpedia-owl:abstract ?abstract.
}

Change "Metallica" to any resource name (resource name as in wikipedia.org/resourcename) for queries pertaining to abstract.

Cableway answered 28/10, 2014 at 14:6 Comment(0)
S
4

Alternatively, you can try to load any of the text of wiki pages simply like this https://bn.wikipedia.org/w/index.php?title=User:ShohagS&action=raw&ctype=text

where change bn to you your wiki language and User:ShohagS will be the page name. In your case use: https://en.wikipedia.org/w/index.php?title=Bla_bla_bla&action=raw&ctype=text

in browsers, this will return a php formated text file.

Saragossa answered 26/9, 2020 at 3:54 Comment(0)
P
2

You can use the wikipedia package of Python, and specifically the content attribute for the given page.

From the documentation:

>>> import wikipedia
>>> print wikipedia.summary("Wikipedia")
# Wikipedia (/ˌwɪkɨˈpiːdiə/ or /ˌwɪkiˈpiːdiə/ WIK-i-PEE-dee-ə) is a collaboratively edited, multilingual, free Internet encyclopedia supported by the non-profit Wikimedia Foundation...

>>> wikipedia.search("Barack")
# [u'Barak (given name)', u'Barack Obama', u'Barack (brandy)', u'Presidency of Barack Obama', u'Family of Barack Obama', u'First inauguration of Barack Obama', u'Barack Obama presidential campaign, 2008', u'Barack Obama, Sr.', u'Barack Obama citizenship conspiracy theories', u'Presidential transition of Barack Obama']
>>> ny = wikipedia.page("New York")
>>> ny.title
# u'New York'
>>> ny.url
# u'http://en.wikipedia.org/wiki/New_York'
>>> ny.content
# u'New York is a state in the Northeastern region of the United States. New York is the 27th-most exten'...
Pomatum answered 17/1, 2022 at 19:14 Comment(0)
A
1

I think the better option is to use the extracts prop that provides you MediaWiki API. It returns you only some tags (b, i, h#, span, ul, li) and removes tables, infoboxes, references, etc.

http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Bla%20Bla%20Bla&format=xml gives you something very simple:

<api><query><pages><page pageid="4456737" ns="0" title="Bla Bla Bla"><extract xml:space="preserve">
<p>"<b>Bla Bla Bla</b>" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, <i>L'Amour Toujours</i>. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with <i>L'Amour Toujours (I'll Fly With You)</i> in its US radio version.</p> <p></p> <h2><span id="Background_and_writing">Background and writing</span></h2> <p>He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song <i>"Why Did You Do It"</i>.</p> <h2><span id="Music_video">Music video</span></h2> <p>The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.</p> <h2><span id="Chart_performance">Chart performance</span></h2> <h2><span id="References">References</span></h2> <h2><span id="External_links">External links</span></h2> <ul><li>Full lyrics of this song at MetroLyrics</li> </ul>
</extract></page></pages></query></api>

You can then run it through a regular expression, in JavaScript would be something like this (maybe you have to do some minor modifications:

/^.*<\s*extract[^>]*\s*>\s*((?:[^<]*|<\s*\/?\s*[^>hH][^>]*\s*>)*).*<\s*(?:h|H).*$/.exec(data)

Which gives you (only paragrphs, bold and italic):

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Attemper answered 22/4, 2015 at 13:58 Comment(0)
Z
1

One option for turning an entire Wikipedia into text is downloading an HTML dump from Wikimedia: https://dumps.wikimedia.org/other/enterprise_html/ (Warning: You will need a lot of space.)

Then you can use for example Python to turn this into text using beautifulsoup. (In this example I also remove tables, because they aren't easy to be turned to pretty text.)

FOLDER = r"D:\CzechWiki\cswiki-NS0-20240301-ENTERPRISE-HTML.json" # Where you unpacked the files from the HTML dump

for file in os.listdir(FOLDER):
    if file.endswith(".ndjson"):
        with open(os.path.join(FOLDER, file), "r", encoding="utf-8") as f:
            for line in f:
                data = json.loads(line)
                soup = BeautifulSoup(data["article_body"]["html"], "lxml")
                for table in soup.find_all("table"):
                    table.decompose()
                text = soup.get_text()
                print(text)

The text will generally have no traces of Wikitext and are generally clean and pretty. One could still do much work removing stuff like categories and Wikipedia warnings ["this page needs to be expanded"], but it is a good starting point. (Parsing Wikitext is super difficult, and here you don't have to do it.)

Zippel answered 16/3 at 21:35 Comment(0)
T
0

"...a script that gets the Wikipedia description section only..."

For your application you might what to look on the dumps, e.g.: http://dumps.wikimedia.org/enwiki/20120702/

The particular files you need are 'abstract' XML files, e.g., this small one (22.7MB):

http://dumps.wikimedia.org/enwiki/20120702/enwiki-20120702-abstract19.xml

The XML has a tag called 'abstract' which contain the first part of each article.

Otherwise wikipedia2text uses, e.g., w3m to download the page with templates expanded and formatted to text. From that you might be able to pick out the abstract via a regular expression.

Torruella answered 9/7, 2012 at 12:37 Comment(0)
S
0

First check here.

There are a lot of invalid syntaxes in MediaWiki's text markup. (Mistakes made by users...) Only MediaWiki can parse this hellish text. But still there are some alternatives to try in the link above. Not perfect, but better than nothing!

Stearn answered 16/5, 2018 at 9:23 Comment(0)
G
-1

You can try the BeautifulSoup HTML parsing library for python,but you'll have to write a simple parser.

Greece answered 15/12, 2010 at 16:17 Comment(0)
D
-1

You can try WikiExtractor: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

It's for Python 2.7 and 3.3+.

Denticle answered 20/12, 2012 at 20:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.