pythonnlpword-list

Matching two-word variants with each other if they don't match alphabetically


I'm doing a NLP project with my university, collecting data on words in Icelandic that exist both spelled with an i and with a y (they sound the same in Icelandic fyi) where the variants are both actual words but do not mean the same thing. Examples of this would include leyti (an approximation in time) and leiti (a grassy hill), or kirkja (church) and kyrkja (choke). I have a dataset of 2 million words. I have already collected two wordlists, one of which includes words spelled with a y and one includes the same words spelled with a i (although they don't seem to match up completely, as the y-list is a bit longer, but that's a separate issue). My problem is that I want to end up with pairs of words like leyti - leiti, kyrkja - kirkja, etc. But, as y is much later in the alphabet than i, it's no good just sorting the lists and pairing them up that way. I also tried zipping the lists while checking the first few letters to see if I can find a match but that leaves out all words that have y or i as the first letter. Do you have a suggestion on how I might implement this?


Solution

  • So this accomplishes my task, kind of an easy not-that-pretty solution I suppose but it works:

    wordlist = open("data.txt", "r", encoding='utf-8')
    y_words = open("y_wordlist.txt", "w+", encoding='utf-8')
    all_words = []
    y_words = []
    
    for word in wordlist:
        word = word.lower()
        all_words.append(word)
    
    for word in all_words:
        if "y" in word:
            y_words.append(word)
    
    word_dict = {}
    
    for word in y_words:
        newwith1y = word.replace("y", "i",1)
        newwith2y = word.replace("y", "i",2)
        newyback = word[::-1].replace("y", "i",1)
        newyback = newyback[::-1]
        word_dict[word] = newwith1y
        word_dict[word] = newwith2y
        word_dict[word] = newyback
    
    for key, value in word_dict.items():
        if value in all_words:
            y_wordlist.write(key)
            y_wordlist.write(" - ")
            y_wordlist.write(value)
            y_wordlist.write("\n")