pythonn-gramutf8-decode

Program in Python doesn't match equal words


I have a list of some 4-grams that I want to find in a text, but I'm having trouble with some words with accents. For example, lets say our 4-gram list is Quad = [(u'Jogos', u'Olímpicos', u'de', u'Verão'), (u'Jogos', u'Olímpicos', u'de', u'Inverno'), (u'Jogos', u'Olímpicos', u'de', u'Sidney')]

I have a small txt for test that this:

'Tasha fez parte da equipe norte americana que disputou os Jogos Olímpicos de Sidney em 2000 na Austrália'

But I'm unable to match that 'Jogos Olímpicos de Sidney' from the text with the one on my 4-gram.

I tried a couple of things,

First: I made a second list Quad2 = [u'Jogos_Ol\xedmpicos_de_Ver\xe3o', u'Jogos_Ol\xedmpicos_de_Inverno', u'Jogos_Ol\xedmpicos_de_Sidney']

That's the output, if I do Quad2[2] I get Jogos_Olímpicos_de_Sidney

When I try

while i < (len(test) - 3):
if (test[i] + '_' + test[i+1] + '_' + test[i+2] + '_' + test[i+3]) in Quad2:
print test[i]

It doesn't print anything.

Second:

while k< len(test)-3:
    for i in range(3):
        if test[k] == Quad[i][0] and test[k+1] == Quad[i][1] and test[k+2] == Quad[i][2] and test[k+3] == Quad[i][3]:
            print test[k]
    k = k+1

With words without accents both methods words, but it words like 'Olímpicos' it does't. Any thoughts?


Solution

  • You need to open your test file to read it as Unicode:

    import codecs
    f = codecs.open('/home/portugues/teste.txt', encoding='utf-8')
    test = f.read().split(' ')