Parsing a Wikipedia dump
Asked Answered
R

9

19

For example using this Wikipedia dump:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm

Is there an existing library for Python that I can use to create an array with the mapping of subjects and values?

For example:

{height_ft,6},{nationality, American}
Rainey answered 11/8, 2010 at 22:44 Comment(0)
K
14

It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.

Krissy answered 12/8, 2010 at 1:26 Comment(3)
thx for the help. I tried the mwlib tutorial in the link you gave me However I am not sure how do I manipulate with the Article object that's returned by the simpleparse. For example how would I rebuild all of the data to xml format with their appropriate titles?Rainey
@quantCode I haven't honestly looked at the state of these tools in recent years, but a quick check on the project's Github repo shows that mwlib still gets regular, if infrequent, updates. It's probably still worth a look if you're planning on doing something in this space.Krissy
the module seems up to date, however, there seems to be NO documentation at all for the python bindings. the standard 'docs' on readthedocs is mainly about installation and the commandline util, which, for them, seems to be the main thing but again, as I said, nothing about the python APIKaenel
V
8

I described how to do this using a combination of pywikibot and mwparserfromhell in this post (don't have enough reputation yet to flag as a duplicate).

In [1]: import mwparserfromhell

In [2]: import pywikibot

In [3]: enwp = pywikibot.Site('en','wikipedia')

In [4]: page = pywikibot.Page(enwp, 'Waking Life')            

In [5]: wikitext = page.get()               

In [6]: wikicode = mwparserfromhell.parse(wikitext)

In [7]: templates = wikicode.filter_templates()

In [8]: templates?
Type:       list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name           = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length:     31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items

In [10]: templates[:2]
Out[10]: 
[u'{{Use mdy dates|date=September 2012}}',
 u"{{Infobox film\n| name           = Waking Life\n| image          = Waking-Life-Poster.jpg\n| image_size     = 220px\n| alt            =\n| caption        = Theatrical release poster\n| director       = [[Richard Linklater]]\n| producer       = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer         = Richard Linklater\n| starring       = [[Wiley Wiggins]]\n| music          = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing        = Sandra Adair\n| studio         = [[Thousand Words]]\n| distributor    = [[Fox Searchlight Pictures]]\n| released       = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime        = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country        = United States\n| language       = English\n| budget         =\n| gross          = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]

In [11]: infobox_film = templates[1]

In [12]: for param in infobox_film.params:
             print param.name, param.value

 name             Waking Life

 image            Waking-Life-Poster.jpg

 image_size       220px

 alt             

 caption          Theatrical release poster

 director         [[Richard Linklater]]

 producer         [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West

 writer           Richard Linklater

 starring         [[Wiley Wiggins]]

 music            Glover Gill

 cinematography   Richard Linklater<br />[[Tommy Pallotta]]

 editing          Sandra Adair

 studio           [[Thousand Words]]

 distributor      [[Fox Searchlight Pictures]]

 released         {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}

 runtime          101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>

 country          United States

 language         English

 budget          

 gross            $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>

Don't forget that params are mwparserfromhell objects too!

Veneering answered 16/1, 2014 at 19:35 Comment(0)
H
6

Just stumbled over a library on PyPi, wikidump, that claims to provide

Tools to manipulate and extract data from wikipedia dumps

I didn't use it yet, so you are on your own to try it...

Hippomenes answered 12/8, 2010 at 16:32 Comment(1)
doesn;t seem to work well. requires python2 and looks to be antiquated. last update was a couple of years agoKaenel
E
3

I know the question is old, but I was searching for a library that parses wikipedia xml dump. However, the suggested libraries, wikidump and mwlib, don't offer many code documentation. Then, I found Mediwiki-utilities, which has some code documentation in: http://pythonhosted.org/mediawiki-utilities/.

Eidolon answered 12/3, 2015 at 20:25 Comment(1)
well, python3 is the standard now anywayKaenel
C
3

WikiExtractor appears to be a clean, simple, and efficient way to do this in Python today: https://github.com/attardi/wikiextractor

It provides an easy way to parse a Wikipedia dump into a simple file structure like so:

<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>

...where each doc looks like:

<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>
Crepe answered 26/10, 2016 at 3:37 Comment(0)
A
1

I know this is an old question, but I here is this great script that reads the wiki dump xml and outputs a very nice csv:

PyPI: https://pypi.org/project/wiki-dump-parser/

GitHub: https://github.com/Grasia/wiki-scripts/tree/master/wiki_dump_parser

Abbe answered 9/6, 2020 at 17:54 Comment(0)
F
0

There's some information on Python and XML libraries here.

If you're asking is there an existing library that's designed to parse Wiki(pedia) XML specifically and match your requirements, this is doubtful. However you can use one of the existing libraries to traverse the DOM and pull out the data you need.

Another option is to write an XSLT stylesheet that does similar and call it using lxml. This also lets you make calls to Python functions from inside the XSLT so you get the best of both worlds.

Foggia answered 11/8, 2010 at 23:19 Comment(1)
Sheesh, why the downvote. If your answer is better, let it stand for itself - mine wasn't flat-out wrong.Foggia
E
-2

You're probably looking for the Pywikipediabot for manipulating the wikipedia API.

Equidistance answered 11/9, 2010 at 17:44 Comment(0)
C
-6

I would say look at using Beautiful Soup and just get the Wikipedia page in HTML instead of using the API.

I'll try and post an example.

Cohbert answered 11/8, 2010 at 23:23 Comment(2)
I know this is an old question, but to anyone that stumbles across this, absolutely do NOT do this. The entire reason Wikipedia offers an API is so that they can efficiently return the raw data users need. Scraping causes completely unnecessary stress on servers by invoking the rendering engines and by returning all article content. APIs bypass rendering and can be used to pull just the subset of data that the user actually needs (e.g., just a single section). Scraping should always be used as a last resort (i.e., if a site doesn't offer an API).Fewer
And even if the HTML would reveal the underlying structure perfectly, you would still have to understand the concept of templates, disambiguation pages, redirects, etc. Better to process the source where this is in plain sight with a reasonably semantically based markup.Archegonium

© 2022 - 2024 — McMap. All rights reserved.