Using MediaWiki to pull text from a Wikia page but it comes back in a big mess is there a better way I could do this to pull text from each section?
Asked Answered
K

3

5

I am developing an Android app that pulls information from a Wikia page and displays it in the app. I currently am pulling all Categories to navigate and have my app set up to display the page in a WebView but I would like to just pull the info and format myself instead of cheapening it by passing to WebView.

What I am using to get the text is: http://scottlandminecraft.wikia.com/api.php?format=xml&action=query&titles=ZackScott&prop=revisions&rvprop=content

My problem is the text comes back in a big clump, does anyone have any ideas as to how to get this more formatted so I could parse from tags or am I wasing my time trying to find that? If so would it be better to find a way to parse the text I need by going from identifiers in the text this pulls, or is there a better way?

Thank you for your input and time.

Kaitlin answered 28/3, 2013 at 13:9 Comment(3)
I don't see what you call "big clump". It's an XML document containing the wikitext of the page - just what your api call requests for. What data are you after, the rendered HTML?Deplore
The "big clump" I was referring to is the mass of text that I pull when I do this, It gets all of the text that I want from the page but its not very organized, I just wasn't sure if there is a better way to pull the text that would make it easier to parse with XML or if I should go with another format and then parse from that, like the others that have posted here have given me excellent options to parse from HTML.Kaitlin
Do you want the wikisyntax parse tree? Do you want the plain wikitext, not wrapped in xml?Deplore
V
11

The easiest way, if you don't want to parse the wiki markup yourself, is to retrieve the parsed HTML version of the page and then process it using an HTML parser (like jsoup, as recommended by Hasham).

Besides just scraping the normal wiki user interface (which will give you the page HTML wrapped in the navigation skin), there are two ways of getting the HTML text of a MediaWiki page:

  1. use the API with action=parse, which will return the page HTML wrapped in a MediaWiki API XML (or JSON / YAML / etc.) response, like this:

  2. or use the main index.php script with action=render, which will return just the page HTML:

Ps. Since you mention sections in your question, let me note that the action=parse API module can return information about the sections on the page using prop=sections (or even prop=sections|text). For an example, see this API query:

Voroshilov answered 28/3, 2013 at 18:11 Comment(1)
Your solution is better than mine.Catamenia
C
3

The content is formatted using wiki syntax. You can render it in HTML using a Java engine called Bliki.

http://code.google.com/p/gwtwiki/

http://code.google.com/p/gwtwiki/wiki/Mediawiki2HTML

Bliki is not thought for Android. You need it to compile it. It seems it can be done:

https://groups.google.com/forum/?fromgroups=#!topic/bliki/LNsmnEEZEV4

Catamenia answered 28/3, 2013 at 13:18 Comment(0)
M
1

If you want to parse the html document then Jsoup is the choice.

Markland answered 28/3, 2013 at 13:35 Comment(4)
There is no HTML document at scottlandminecraft.wikia.com/…Deplore
Its xml you can parse it with JSOUP.Markland
No. For XML you do not use a HTML parser.Deplore
@Bergi: Jsoup is actually more a toolbox than a parser. And some of the tools are really useful for processing XML.Catamenia

© 2022 - 2024 — McMap. All rights reserved.