pythonregexnlpmorphological-analysis

Print all the tokens in the file that are labelled with the morphological tag


I want to print all the tokens which are labellad with the morphological tag in a file. So far I wrote the code shown below.

def index(filepath, string):

    import re
    pattern = re.compile(r'(\w+)+')
    StringList = []
    StringList.append(string)

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = set(m.group(1) for m in pattern.finditer(line))
            matches = [keyword for keyword in StringList if keyword in words]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)

    StringList.clear()



index('deneme.txt', '+Noun')

The output is like this, I can find the Noun in the token and the line number but can't print the part which I wanted. I only want the word part which is before + sign.

Noun            1
Noun            2
Noun            3
Noun            4
Noun            5
Noun            6
Noun            7

The lines in my file is like this:

Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc 
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl 
club+Noun toplantı+Noun+A3pl+P3sg 
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
club+Noun toplantı+Noun+A3pl+P3sg 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj 

I want to get the tokens forexample when i write a tag. Forexample when I write +Adj I want to get all the tokens which include +Adj (nispi, izafi .... (forexample)).


Solution

  • I think, your concept how to use regexes needs some improvement.

    Note that each input line contains a number of "tokens", e.g. terörizm+Noun+Gen. As you can see, it contains:

    So:

    A good habit it to strip the terminating blank chars (at least \n).

    Note also that your code contains StringList, so you are aware of the case that this function may look for one or more of multiple classification words.

    I programmed it a slightly different way:

    The decision whether to print a word (the first word from a token) is based on whether at least one of its classification symbols can be found in lookForSet. To put it another way - whether lookForSet and wordSet have some common elements (set intersection).

    So the whole script can look like below:

    import re
    
    def index(fileName, lookFor):
        lookForSet = set(lookFor)  # Set of classification symbols to look for
        pat1 = re.compile(r'\s+')  # Regex to split line into tokens
        pat2 = re.compile(r'\+')   # Regex to split a token into words
        with open(fileName) as f:
            for lineNo, line in enumerate(f, start=1):
                line = line.rstrip()
                tokens = pat1.split(line)
                for token in tokens:
                    words = pat2.split(token)
                    word1 = words.pop(0)  # Initial word
                    wordSet = set(words)  # Classification words
                    commonWords = lookForSet.intersection(wordSet)
                    if commonWords:
                        print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))
    
    index('lines.txt', ['Noun', 'Gen'])
    

    A piece of output from it, for my input data (slightly shortened version of your) is like below:

    1: Türkiye         Noun
    1: terörizm        Noun, Gen
    1: kitle           Noun
    1: imha            Noun
    2: Türkiye         Noun, Gen
    2: potansiyel      Noun
    

    It contains: