Parsing Wikipedia countries, regions, cities
Asked Answered
U

2

7

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task. What is be the easiest way to parse all the information I need? PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...

Unbelievable answered 11/7, 2014 at 11:16 Comment(2)
You should have a look at dbpedia.org. Parsing Wikipedia is anything but trivial.Dudek
This is a good task for either WikiData or DBPedia. Parsing infoboxes or categories would be a terribly complicated way to reinvent the wheel.Staceystaci
C
6

[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help


It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.

I suggest the following workflow:

  • Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.

  • Use the Wikidata Query API claim command to output every entity that as the chosen P31 property. Lets try with country (Q6256):

    http://wdq.wmflabs.org/api?q=claim[31:6256]

It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))

  • You may want more than ids though, so you can ask Wikidata Official API for entities data:

    https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr

    (here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)

You can query multiple entities at a time, with a limit of 50, as follow:

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
  • From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.

  • Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.

  • You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like

    http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]

    then repeat for all the countries: have fun!

Calliope answered 9/9, 2014 at 17:13 Comment(2)
Finally I read a well-researched answer on Wikidata. :) On «the data is still incomplete as there are only 141 items found»: I wouldn't say so, because you didn't consider subclasses i.e. more specific terms than "country" which entities are using.Spinozism
indeed! amending my statementCalliope
Z
2

You should go with Wikidata and/or dbpedia.

Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).

Here's a nice overview of ways to access Wikidata

Zeta answered 18/8, 2014 at 11:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.