Extract related articles in different languages using Wikidata Toolkit
Asked Answered
H

1

1

I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one. I tried to use the toolkit, but I couldn't find any solution. Please write some example about how to find this related article.

Hornmad answered 22/1, 2018 at 17:7 Comment(5)
Ideas: #48333327Ideography
Thank you Stanislav. I need to investigate the full version of english wikipedia articles (with their content) and it's Spanish translated version. Do you know how to extract this articles and their translated version using Wikidata Toolkit. Could you please introduce methods of Wikidata Toolkit those are related to extracting these interlingual related articles?Hornmad
See the example file SitelinksExample.java.Klein
Thanks @Tgr. But this example don't extract interlanguage articles :(Hornmad
Well, no, it's the Wikidata toolkit. Wikidata does not contain those aritcles. But the toolkit tells you what the articles are.Klein
R
2

you can use Wikidata dump [1] to get a mapping of articles among wikipedias in multiple language.

for example if you see the wikidata entry for Respiratory System[2] at the bottom you see all the articles referring to the same topic in other languages.

That mapping is available in the wikidata dump. Just download wikidata dump and get the mapping and then get the corresponding text from the wikipedia dump. You might encounter some other issues, like resolving wikipedia redirects.

[1] https://dumps.wikimedia.org/wikidatawiki/entities/ [2] https://www.wikidata.org/wiki/Q7891

Ruthieruthless answered 5/2, 2018 at 4:59 Comment(7)
Thanks @David. Does Wikidata Toolkit give me the content (text) of each related article or I myself should write a code to extract those? The size of dump file is huge and it's so difficult for me to download and analyze that.Hornmad
Could you please give me the address of Wikipedia dump. I can't find any wikipedia dump. Apparently, It combined with Wikimedia project and I don't know what file I should download. Thank you.Hornmad
I think wikidata might contain the english abstract but definitely not the text for all languages.Ruthieruthless
@Hornmad what you can do is use a project like: github.com/idio/json-wikipedia to generate a json wikipedia for the languages that you needRuthieruthless
@Hornmad with respecto tot he dumps dumps.wikimedia.org/backup-index.html there you cna find them. for example enwiki-20180201-pages-articles-multistream.xml.bz2 is the english wikipedia. eswiki : pages articles would be the spanish wikipedia... and so on..Ruthieruthless
Thank you @David. The github link might be useful. Another question: What is difference of standard article and redirection one?Hornmad
@Hornmad a wikipedia article can have many aliases, for example en.wikipedia.org/wiki/New_York_City and en.wikipedia.org/wiki/NYC a Redirect take aliases to their canonical name. the canonical name can change between wikipedia dumps. That is one of the motivations for Wikidata to use more abstract names for the topics: Q123 (for example)Ruthieruthless

© 2022 - 2024 — McMap. All rights reserved.