pythonregexunicodetamilindic

Regex to get list of all words with specific letters (unicode graphemes)


I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).

I need to draw out the subset of those words that can be spelled using just those letters.

An English example:

words = ["cat", "dog", "tack", "coat"] 

get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]

A Tamil example:

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]

get_words([u'ம', u'ப', u'ட', u'ம்')  should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"] 

The order that the words are returned in, or the order that the letters are entered in should not make a difference.

Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.

In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).


Solution

  • To support characters that can span several Unicode codepoints:

    # -*- coding: utf-8 -*-
    import re
    import unicodedata
    from functools import partial
    
    NFKD = partial(unicodedata.normalize, 'NFKD')
    
    def match(word, letters):
        word, letters = NFKD(word), map(NFKD, letters) # normalize
        return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)
    
    words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
    get_words = lambda letters: [w for w in words if match(w, letters)]
    
    print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
    # -> மடம் படம்
    print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
    # -> படம்
    

    It assumes that the same character can be used zero or more times in a word.

    If you want only words that contain exactly given characters:

    import regex # $ pip install regex
    
    chars = regex.compile(r"\X").findall # get all characters
    
    def match(word, letters):
        return sorted(chars(word)) == sorted(letters)
    
    words = ["cat", "dog", "tack", "coat"]
    
    print(" ".join(get_words(['o', 'c', 'a', 't'])))
    # -> coat
    print(" ".join(get_words(['k', 'c', 't', 'a'])))
    # -> tack
    

    Note: there is no cat in the output in this case because cat doesn't use all given characters.


    What does it mean to normalize? And could you please explain the syntax of the re.match() regex?

    >>> import re
    >>> re.escape('.')
    '\\.'
    >>> c = u'\u00c7'
    >>> cc = u'\u0043\u0327'
    >>> cc == c
    False
    >>> re.match(r'%s$' % (c,), cc) # do not match
    >>> import unicodedata
    >>> norm = lambda s: unicodedata.normalize('NFKD', s)
    >>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
    <_sre.SRE_Match object at 0x1364648>
    >>> print c, cc
    Ç Ç
    

    Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.