How may Wiktionary's API be used to determine whether or not a word exists?
The Wiktionary API can be used to query whether or not a word exists.
Examples for existing and non-existing pages:
http://en.wiktionary.org/w/api.php?action=query&titles=test
http://en.wiktionary.org/w/api.php?action=query&titles=testx
The first link provides examples on other types of formats that might be easier to parse.
To retrieve the word's data in a small XHTML format (should more than existence be required), request the printable version of the page:
http://en.wiktionary.org/w/index.php?title=test&printable=yes
http://en.wiktionary.org/w/index.php?title=testx&printable=yes
These can then be parsed with any standard XML parser.
https://en.wiktionary.org/w/?curid=[page_id]&printable=yes
, to redirect to the XHTML page using pageid
. –
Alcibiades https://en.wiktionary.org/w/api.php?format=json&action=query&origin=*&export&exportnowrap&titles=test
to avoid CORS-related problems –
Margalo https://en.wiktionary.org/w/api.php?action=query&format=xml&prop=categories&titles=
WORDS%7C
TO%7C
CHECK&clcategories=Category%3AEnglish%20lemmas%7CCategory%3AEnglish%20non-lemma%20forms%7CCategory%3AEnglish%20eye%20dialect
. Then, "valid in English" means a result that has the category "English lemmas" or "English non-lemma forms" but doesn't have the category "English eye dialect". However the set of words meeting these criteria may still be overly broad for many uses. –
Golf 2024 UPDATE!
It seems that a new MediaWiki REST API has appeared since I last played with this stuff. And the biggest news is that it includes a method to get definitions from the English Wiktionary!
/page/definition/{term}
Get term definitions based on Wiktionary content. Experimental end point providing term definitions extracted from Wiktionary content. Currently, only English Wiktionary is supported. See this wiki page for background and considerations for further development.Stability: stable
Please follow wikitech-l or mediawiki-api-announce for announcements of breaking changes.
Old answer
There are a few caveats in just checking that Wiktionary has a page with the name you are looking for:
Caveat #1: All Wiktionaries including the English Wiktionary actually have the goal of including every word in every language, so if you simply use above API call you will know that the word you are asking about is a word in at least one language, but not necessarily English: http://en.wiktionary.org/w/api.php?action=query&titles=dicare
Caveat #2: Perhaps a redirect exists from one word to another word. It might be from an alternative spelling, but it might be from an error of some kind. The API call above will not differentiate between a redirect and an article: http://en.wiktionary.org/w/api.php?action=query&titles=profilemetry
Caveat #3: Some Wiktionaries including the English Wiktionary include "common misspellings": http://en.wiktionary.org/w/api.php?action=query&titles=fourty
Caveat #4: Some Wiktionaries allow stub entries which have little or no information about the term. This used to be common on several Wiktionaries but not the English Wiktionary. But it seems to have now spread also to the English Wiktionary: https://en.wiktionary.org/wiki/%E6%99%B6%E7%90%83 (permalink for when the stub is filled so you can still see what a stub looks like: https://en.wiktionary.org/w/index.php?title=%E6%99%B6%E7%90%83&oldid=39757161)
If these are not included in what you want, you will have to load and parse the wikitext itself, which is not a trivial task.
&prop=info
to the query and check the response for redirect
attribute. –
Brion You can download a dump of Wiktionary data. There's more information in the FAQ. For your purposes, the definitions dump is probably a better choice than the XML dump.
To keep it really simple, extract the words from the dump like this:
bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words
LANGwiktionary-DATE-pages-articles.xml.bz2
. Go to link, then click LANGwiktionary
(LANG e.g. 'en', 'de'...). –
Crissie bzcat pages-articles.xml.bz2 | grep '<title>\(.*\)</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words
–
Rosenstein If you are using Python, you can use WiktionaryParser by Suyash Behera.
You can install it by
pip install wiktionaryparser
Example usage:
from pprint import pprint
from wiktionaryparser import WiktionaryParser
parser = WiktionaryParser()
word = parser.fetch('test')
pprint(word)
another_word = parser.fetch('test', 'french')
pprint(another_word)
# features
parser.set_default_language('french')
parser.exclude_part_of_speech('noun')
parser.include_relation('alternative forms')
You could use the revisions API:
Or the parse API:
https://en.wiktionary.org/w/api.php?action=parse&page=test&prop=wikitext&formatversion=2
More examples are provided in the documentation.
&format=json
to the urls to have a formatted response. –
Cleancut As mentioned earlier, the problem with this approach is that Wiktionary provides the information about all the words of all the languages. So the approach to check if a page exists using Wikipedia API won't work because there're a lot of pages for non-English words. To overcome this, you need to parse each page to figure out if there's a section describing the English word. Parsing wikitext isn't a trivial task, though in your case it's not that bad. To cover almost all the cases you need to just check if the wikitext contains the English
heading. Depending on the programming language you use, you can find some tools to build an AST from wikitext. This will cover most of the cases, but not all of them because Wiktionary includes some common misspellings.
As an alternative, you could try using Lingua Robot or something similar. Lingua Robot parses the Wiktionary content and provides it as a REST API. A non-empty response means that the word exists. Please note that, as opposed to Wiktionary, the API itself doesn't include any misspellings (at least at the moment of writing this answer). Please also note that the Wiktionary contains not only the words, but multi-word expressions.
You might want to try JWKTL out. I just found out about it ;)
Here's a start to parsing etymology and pronunciation data:
function parsePronunciationLine(line) {
let val
let type
line.replace(/\{\{\s*a\s*\|UK\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
val = $1
type = 'uk'
})
line.replace(/\{\{\s*a\s*\|US\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
val = $1
type = 'us'
})
line.replace(/\{\{enPR|[^\}]+\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
val = $1
type = 'us'
})
line.replace(/\{\{a|GA\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
val = $1
type = 'ga'
})
line.replace(/\{\{a|GA\}\},?.+\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
val = $1
type = 'ga'
})
// {{a|GA}} {{IPA|/ˈhæpi/|lang=en}}
// * {{a|RP}} {{IPA|/pliːz/|lang=en}}
// * {{a|GA}} {{enPR|plēz}}, {{IPA|/pliz/|[pʰliz]|lang=en}}
if (!val)
return
return { val, type }
}
function parseEtymologyPiece(piece) {
let parts = piece.split('|')
parts.shift() // The first one is ignored.
let ls = []
if (langs[parts[0]]) {
ls.push(parts.shift())
}
if (langs[parts[0]]) {
ls.push(parts.shift())
}
let l = ls.pop()
let t = parts.shift()
return [ l, t ]
// {{inh|en|enm|poisoun}}
// {{m|enm|poyson}}
// {{der|en|la|pōtio|pōtio, pōtiōnis|t=drink, a draught, a poisonous draught, a potion}}
// {{m|la|pōtō|t=I drink}}
// {{der|en|enm|happy||fortunate, happy}}
// {{cog|is|heppinn||lucky}}
}
Here is a gist with it more fleshed out.
langs
? –
Petiolate langs
is a few thousand lines, too big for SO. –
Haplosis I created my own open source Wiktionary API project. It is based on the wiktextract data and in general has much more information than the official API: For example IPAs, etymology information, canonical forms of words (describing stress in many languages, for example), translation tables.
It also not only contains information from the English Wiktionary, but 6 different Wiktionaries, and can translate any language pair, so for example from French to Czech.
(I currently have it hosted, but can't make any guarantees about uptime etc., but each one can easily self-host if needed).
© 2022 - 2024 — McMap. All rights reserved.