I downloaded a Wikipedia dump and I want to convert the wiki format into my object format. Is there a wiki parser available that converts the object into XML?
See java-wikipedia-parser. I have never used it but according to the docs :
The parser comes with an HTML generator. You can however control the output that is being generated by passing your own implementation of the
be.devijver.wikipedia.Visitor
interface.
I do not know how exactly looks xml format of Wikipedia dump. But, if a part of the text is in Wikipedia markup, I suggest to investigate http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html. This is one of the classes of a Wikipedia package for apache lucene. I didn't use it but apache lucene is a quite mature project, so it is worth to try its -- in this case experimental -- package.
The JWPL parser analyzes the structure of a text with MediaWiki markup and represents it as a Java object. This allows for structured access to the contents of e.g. Wikipedia or Wiktionary. There is no standalone release of the parser, as it is part of the JWPL Wikipedia API release. However, it can be used perfectly without accessing Wikipedia with JWPL.
This might help: a page with converters from mediawiki to other formats, including docbook. Docbook is a standard xml based format that might fit your needs (xml representation of mediawiki content)
You can use a wide range of tools to parse your contents. All script languages have modules. For example Perl language have Text::Markup::Trac which is the Trac wiki syntax parser for Text::Markup. It generates an HTML file.
u could try wikiprep it's a perl wikipedia parser check it's page
it outputs many files some of them are
1- wikipedia parsed into XML 2- cat-hier file , which contains wikipedia categories hierarchy
i've tried it and it's very useful it's only problem that it needs high memory available for processing most probably more than 4gb RAM also u can download a preprepared XML version from here which is available also on the page
© 2022 - 2024 — McMap. All rights reserved.