I think you'll need to either:
- parse the already existing list of english words in the wiktionary, which were extracted from a database dump.
- download the database dump (and not only the titles) and extract the terms yourself.
I tried option a) only because option b) would imply a several GB download.
It's very simple, in fact I include a quick JS implementation that you can use as a base to create your own script in your preferred language.
var baseURL="http://en.wiktionary.org/wiki/Index:English/"
var letters=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
for(i=0;i<letters.length;i++) {
var letter = letters[i];
console.log(letter);
$.get(baseURL+letter, function(response) {
$(response).find('ol li a').each( function (k,v) { console.log(v.text) })
})
}
EDIT
I was quite curious on the subject myself, so I wrote a python script. Just in case somebody finds it useful:
from lxml.cssselect import CSSSelector
from lxml.html import fromstring
import urllib2
url = 'http://en.wiktionary.org/wiki/Index:English/'
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
for l in letters:
req = urllib2.Request(url+l, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
response = con.read()
h = fromstring(response)
sel = CSSSelector("ol li a")
for x in sel(h):
print x.text.encode('utf-8')
I'd paste the results to pastebin myself but the 500kb limit won't let me
Index:English
) is gone now. Similar data remains available at en.wiktionary.org/wiki/Category:English_lemmas . That page is paginated, though, so scraping it will require code a little more complex than the scripts in this answer. – Nonperformance