Easy way to get wiktionary titles only in one language?

Asked 18/3, 2013 at 12:45 Answered 14/6 at 18:15

I can get easily a dump with all the titles in the wiktionary, but this dump contains every word, even non-English ones.

For example, you find souris (mousein French): https://en.wiktionary.org/wiki/souris

Is there an easy way or an existing script to get only the titles in one specific language. I would like to get all the English words from the wiktionary, excluding the ones which do not exist in this language.

So far my only idea is to parse the text and check if there is a ==English== line, but it is too slow to be usable.

Lilialiliaceous answered 18/3, 2013 at 12:45 Comment(0)

I think you'll need to either:

parse the already existing list of english words in the wiktionary, which were extracted from a database dump.
download the database dump (and not only the titles) and extract the terms yourself.

I tried option a) only because option b) would imply a several GB download. It's very simple, in fact I include a quick JS implementation that you can use as a base to create your own script in your preferred language.

var baseURL="http://en.wiktionary.org/wiki/Index:English/"
var letters=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

for(i=0;i<letters.length;i++) {
    var letter = letters[i];
    console.log(letter);
    $.get(baseURL+letter, function(response) { 
        $(response).find('ol li a').each( function (k,v) { console.log(v.text) })    
    })
}

EDIT I was quite curious on the subject myself, so I wrote a python script. Just in case somebody finds it useful:

from lxml.cssselect import CSSSelector
from lxml.html import fromstring
import urllib2

url = 'http://en.wiktionary.org/wiki/Index:English/'
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
for l in letters:
    req = urllib2.Request(url+l, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib2.urlopen( req )
    response = con.read()
    h = fromstring(response)
    sel = CSSSelector("ol li a")

    for x in sel(h):
        print x.text.encode('utf-8')

I'd paste the results to pastebin myself but the 500kb limit won't let me

Loydloydie answered 18/3, 2013 at 13:42 Comment(1)

It looks like that page (Index:English) is gone now. Similar data remains available at en.wiktionary.org/wiki/Category:English_lemmas . That page is paginated, though, so scraping it will require code a little more complex than the scripts in this answer. – Nonperformance 18/3, 2019 at 5:18

As Greg Price has pointed out in his comment, the page http://en.wiktionary.org/wiki/Index:English doesn't exist anymore.

Instead of manually scraping data from https://en.wiktionary.org/wiki/Category:English_lemmas though, one can use the extracted data from tools such as wiktextract. Simply go to https://kaikki.org/dictionary/rawdata.html and download one of the files for the respective language, such as de-extract.json.gz (this is for German, for example). After decompressing the file with gunzip /tmp/de-extract.json.gz, run the following Python script to extract German titles only and save them in /tmp/out.csv:

import json

with open('/tmp/de-extract.json') as f_in,\
        open('/tmp/out.csv', 'w') as f_out:
    for line in f_in:
        data = json.loads(line)
        try:
            if data['lang_code'] == 'de':
                f_out.write(''.join([data['word'], '\n']))
        except KeyError:
            pass

If you need the file to be sorted and duplicates removed you can now run sort -u /tmp/out.csv > /tmp/out_sorted.csv.

Writhen answered 14/6 at 18:15 Comment(0)

The solution and code samples serans posted were great, but I had trouble getting his python code to run.

I followed his example and wrote a ruby version:

#!/usr/bin/env ruby

require 'net/http'
require "rexml/document"

url = 'http://en.wiktionary.org/wiki/Index:English/'

('a'..'z').to_a.each do |letter|
  response = Net::HTTP.get(URI(url + letter))
  doc = REXML::Document.new(response)
  REXML::XPath.each(doc, "//ol/li/a") do |element|
    puts element.text
  end
end

Directoire answered 20/11, 2013 at 5:14 Comment(0)

Following from @serans' answer, I've created a GitHub Gist to do the same in Swift

https://gist.github.com/ashleymills/549ab8aff05ec90f4350#file-wiktionaryfetcher-swift

Chemotherapy answered 3/2, 2015 at 14:38 Comment(0)

Recommended topics

Hot tags