I want to print all the tokens which are labellad with the morphological tag in a file. So far I wrote the code shown below.
def index(filepath, string):
import re
pattern = re.compile(r'(\w+)+')
StringList = []
StringList.append(string)
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = set(m.group(1) for m in pattern.finditer(line))
matches = [keyword for keyword in StringList if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
StringList.clear()
index('deneme.txt', '+Noun')
The output is like this, I can find the Noun in the token and the line number but can't print the part which I wanted. I only want the word part which is before + sign.
Noun 1
Noun 2
Noun 3
Noun 4
Noun 5
Noun 6
Noun 7
The lines in my file is like this:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl
club+Noun toplantı+Noun+A3pl+P3sg
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
club+Noun toplantı+Noun+A3pl+P3sg
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
I want to get the tokens forexample when i write a tag. Forexample when I write +Adj I want to get all the tokens which include +Adj (nispi, izafi .... (forexample)).
I think, your concept how to use regexes needs some improvement.
Note that each input line contains a number of "tokens", e.g. terörizm+Noun+Gen
.
As you can see, it contains:
+
char.So:
+
char,+
) are classification symbols.A good habit it to strip the terminating blank chars (at least \n
).
Note also that your code contains StringList
, so you are aware of the
case that this function may look for one or more of multiple
classification words.
I programmed it a slightly different way:
lookFor
) is a list of words, which is
converted into a set (lookForSet
).The decision whether to print a word (the first word from a token) is based on
whether at least one of its classification symbols can be found in lookForSet
.
To put it another way - whether lookForSet
and wordSet
have some
common elements (set intersection).
So the whole script can look like below:
import re
def index(fileName, lookFor):
lookForSet = set(lookFor) # Set of classification symbols to look for
pat1 = re.compile(r'\s+') # Regex to split line into tokens
pat2 = re.compile(r'\+') # Regex to split a token into words
with open(fileName) as f:
for lineNo, line in enumerate(f, start=1):
line = line.rstrip()
tokens = pat1.split(line)
for token in tokens:
words = pat2.split(token)
word1 = words.pop(0) # Initial word
wordSet = set(words) # Classification words
commonWords = lookForSet.intersection(wordSet)
if commonWords:
print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))
index('lines.txt', ['Noun', 'Gen'])
A piece of output from it, for my input data (slightly shortened version of your) is like below:
1: Türkiye Noun
1: terörizm Noun, Gen
1: kitle Noun
1: imha Noun
2: Türkiye Noun, Gen
2: potansiyel Noun
It contains:
lookFor
have been found in this token.