mediawiki-apiwiktionary

Easy way to get wiktionary titles only in one language?


I can get easily a dump with all the titles in the wiktionary, but this dump contains every word, even non-English ones.

For example, you find souris (mousein French): https://en.wiktionary.org/wiki/souris

Is there an easy way or an existing script to get only the titles in one specific language. I would like to get all the English words from the wiktionary, excluding the ones which do not exist in this language.

So far my only idea is to parse the text and check if there is a ==English== line, but it is too slow to be usable.


Solution

  • I think you'll need to either:

    I tried option a) only because option b) would imply a several GB download. It's very simple, in fact I include a quick JS implementation that you can use as a base to create your own script in your preferred language.

    var baseURL="http://en.wiktionary.org/wiki/Index:English/"
    var letters=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
    
    for(i=0;i<letters.length;i++) {
        var letter = letters[i];
        console.log(letter);
        $.get(baseURL+letter, function(response) { 
            $(response).find('ol li a').each( function (k,v) { console.log(v.text) })    
        })
    }
    

    EDIT I was quite curious on the subject myself, so I wrote a python script. Just in case somebody finds it useful:

    from lxml.cssselect import CSSSelector
    from lxml.html import fromstring
    import urllib2
    
    url = 'http://en.wiktionary.org/wiki/Index:English/'
    letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
    for l in letters:
        req = urllib2.Request(url+l, headers={'User-Agent' : "Magic Browser"}) 
        con = urllib2.urlopen( req )
        response = con.read()
        h = fromstring(response)
        sel = CSSSelector("ol li a")
    
        for x in sel(h):
            print x.text.encode('utf-8')
    

    I'd paste the results to pastebin myself but the 500kb limit won't let me