pythontokennltktokenize

tokenize sentence into words python


I want to extract information from different sentences, so I'm using nltk to divide each sentence into words. I'm using this code:

words=[]
for i in range(len(sentences)):
    words.append(nltk.word_tokenize(sentences[i]))
    words

It works pretty good, but I want something a little bit different than that. For example, I have this sentence:

'[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']'

I want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)" to be one word and not divided into several single words .

UPDATE: I want something like that:

[
 'Jan',
 '31',
 '19:28:14',
 'nginx',
 '10.0.0.0',
 '31/Jan/2019:19:28:14',
 '+0100',
 'POST',
 '/test/itf/',
 'HTTP/x.x',
 '404',
 '146',
 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']

Any idea how to make it possible?


Solution

  • You can import re and parse the log line (which is not a natural language sentence) with a regex:

    import re
    
    sentences = ['[\'Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"\']']
    
    rx = re.compile(r'\b(\w{3})\s+(\d{1,2})\s+(\d{1,2}:\d{1,2}:\d{2})\s+(\w+)\W+(\d{1,3}(?:\.\d{1,3}){3})(?:\s+\S+){2}\s+\[([^][\s]+)\s+([+\d]+)]\s+"([A-Z]+)\s+(\S+)\s+(\S+)"\s+(\d+)\s+(\d+)\s+\S+\s+"([^"]*)"')
    
    words=[]
    for sent in sentences:
        m = rx.search(sent)
        if m:
            words.append(list(m.groups()))
        else:
            words.append(nltk.word_tokenize(sent))
    
    print(words)
    

    See the Python demo.

    The output will look like

    [['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]