How to parse Wiktionary API?
Asked Answered
I

1

8

There is a lack of online resources that demonstrate how I might parse a Wiktionary API response, that looks like this:

{
    "query": {
        "pages": {
            "40915": {
                "pageid": 40915,
                "ns": 0,
                "title": "reluctant",
                "revisions": [
                    {
                        "contentformat": "text/x-wiki",
                        "contentmodel": "wikitext",
                        "*": "==English==\n\n===Etymology===\nFrom {{etyl|la|en}} {{term|lang=la|reluctans}}, present participle of {{term|reluctare}}, {{term|reluctari||to struggle against, oppose, resist}}, from {{term|re-||back}} + {{term|luctari||to struggle}}.\n\n===Pronunciation===\n* {{IPA|/ɹɪˈlʌktənt/}}\n* {{audio|en-us-reluctant.ogg|Audio (US)}}\n\n===Adjective===\n{{en-adj}}\n\n# {{context|now|_|rare|lang=en}} [[opposing|Opposing]]; offering [[resistance]] (to).\n#* '''1819''', Lord Byron, ''Don Juan'', II.108:\n#*: There, breathless, with his digging nails he clung / Fast to the sand, lest the returning wave, / From whose '''reluctant''' roar his life he wrung, / Should suck him back to her insatiate grave [...].\n#* '''2008''', Kern Alexander et al., ''The World Trade Organization and Trade in Services'', p. 222:\n#*: They are '''reluctant''' to the inclusion of a necessity test, especially of a horizontal nature, and emphasize, instead, the importance of procedural disciplines [...].\n# Not [[wanting]] to take some [[action]]; [[unwilling]].\n#: ''She was '''reluctant''' to lend him the money''\n\n====Synonyms====\n* [[unwilling]], [[disinclined]]\n\n====Translations====\n{{trans-top|not wanting to take some action}}\n* Chinese: \n*: Mandarin: {{t|cmn|不情願|sc=Hani}}, {{t+|cmn|不情愿|tr=bùqíngyuàn|sc=Hani}}\n* Czech: {{t|cs|neochotný}}, {{t|cs|zdráhající}} se\n* Dutch: {{t+|nl|aarzelend}}\n* Finnish: {{t+|fi|haluton}}, {{t+|fi|vastahakoinen}}\n* French: {{t+|fr|réservé}},  {{t+|fr|réfractaire}},  {{t+|fr|rétif}}\n* German: {{t|de|zögernd}}\n* Hungarian: {{t|hu|kelletlen}}\n* Indonesian: {{t+|id|enggan}}\n* Interlingua: [[reluctante]]\n* Italian: {{t+|it|riluttante}}\n{{trans-mid}}\n* Latin: {{t|la|invītus}}\n* Manx: {{t|gv|neuarryltagh}}, {{t|gv|neuwooiagh}}\n* Maori: {{t|mi|whakawhēuaua}}, {{t|mi|manauhea}}\n* Polish: [[niechętny]]\n* Romanian: reticent, precaut, {{t|ro|prevăzător}}\n* Russian: {{t+|ru|неохотный|tr=neoxótnyj}}\n* Scots: {{t|sco|sweer}}, {{t|sco|sweirt}}, {{t|sco|laith}}\n* Scottish Gaelic: {{t|gd|aindeònach}}, {{t|gd|leisg}}\n* Spanish: {{t+|es|renuente}}, {{t|es|reacio}}\n* Swedish: {{t|sv|motvillig}}\n{{trans-bottom}}\n\n====Related terms====\n* [[reluctance]]\n* [[reluctantly]]\n\n===External links===\n* {{R:Webster 1913}}\n* {{R:Century 1911}}\n* {{R:OneLook}}\n\n[[ca:reluctant]]\n[[cy:reluctant]]\n[[et:reluctant]]\n[[el:reluctant]]\n[[es:reluctant]]\n[[fr:reluctant]]\n[[ko:reluctant]]\n[[io:reluctant]]\n[[kn:reluctant]]\n[[ku:reluctant]]\n[[hu:reluctant]]\n[[mg:reluctant]]\n[[ml:reluctant]]\n[[my:reluctant]]\n[[nl:reluctant]]\n[[pl:reluctant]]\n[[pt:reluctant]]\n[[simple:reluctant]]\n[[fi:reluctant]]\n[[sv:reluctant]]\n[[ta:reluctant]]\n[[te:reluctant]]\n[[th:reluctant]]\n[[vi:reluctant]]\n[[zh:reluctant]]"
                    }
                ]
            }
        }
    }
}

Basically all I want is the English definition, but the response format is so odd, that everything about the word is jumbled up into one large inseparable blob.

  1. Is there an API way to get the response in an actual JSON format, where the English definition would just be a JSON key?
  2. Would I have to resort to a regex pattern to do this, and how might that look?
  3. Lastly, why would the API designers return data like this? I want to judge and say they have no idea what they're doing, but surely there must be a reason.
Inhabit answered 2/12, 2013 at 20:39 Comment(5)
The obvious answer to why the API doesn't break down the page into definitions is that it's a generic mediawiki API, not a wiktionary API, and doesn't know anything about the structure of the page (which is just a set of conventions followed by wiktionary contributors, not a formally specified, machine-parseable standard).Grudge
As I'm not affiliated with Wiktionary (but having parsed their data in our project), I can only assume, that the reason for the structure is, that they use a normal MediaWiki as foundation which does not provide a "dictionary style" structure. In our project we parsed the database dump using a combination String#indexOf, #substring, etc. and a bunch of regular expressions. Terrible code and maintenance nightmare.Streamway
mediawiki.org/wiki/Alternative_parsers looks like a good place to start parsing the wikitext. The final step of deciding how the wiki syntax tree maps onto dictionary definitions will be up to you though.Grudge
Wiktionary-l has the experience of many people on how they did this.Radke
Possible duplicate of Has anyone parsed Wiktionary?Radke
B
7

use extracts property to get html version

https://en.wiktionary.org/w/api.php?titles=cloud&action=query&prop=extracts&format=json

Bozen answered 11/8, 2016 at 8:8 Comment(1)
This used to be the best way but recently the responses changed and they no longer return headers and subheaders. For example, if a word has a meaning in many languages, these are separated by headers, the current response omits these headers and returns a single text for all the languages.Fortyfive

© 2022 - 2024 — McMap. All rights reserved.