pythonmachine-learningnlpnamed-entity-recognitionpython-crfsuite

Numeric conversion of textual features in crfsuite


I was looking at the example code provided in the docs of crfsuite-python and it has the following code for feature defining.

def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]

features = [
    'bias',
    'word.lower=' + word.lower(),
    'word[-3:]=' + word[-3:],
    'word[-2:]=' + word[-2:],
    'word.isupper=%s' % word.isupper(),
    'word.istitle=%s' % word.istitle(),
    'word.isdigit=%s' % word.isdigit(),
    'postag=' + postag,
    'postag[:2]=' + postag[:2],
]
if i > 0:
    word1 = sent[i-1][0]
    postag1 = sent[i-1][1]
    features.extend([
        '-1:word.lower=' + word1.lower(),
        '-1:word.istitle=%s' % word1.istitle(),
        '-1:word.isupper=%s' % word1.isupper(),
        '-1:postag=' + postag1,
        '-1:postag[:2]=' + postag1[:2],
    ])
else:
    features.append('BOS')
    
if i < len(sent)-1:
    word1 = sent[i+1][0]
    postag1 = sent[i+1][1]
    features.extend([
        '+1:word.lower=' + word1.lower(),
        '+1:word.istitle=%s' % word1.istitle(),
        '+1:word.isupper=%s' % word1.isupper(),
        '+1:postag=' + postag1,
        '+1:postag[:2]=' + postag1[:2],
    ])
else:
    features.append('EOS')
            
return features

I understand that features such as isupper() can be either 0 or 1 but for features such as word[-2:] which are characters ,how are they converted to numeric terms?


Solution

  • CRF trains upon sequence of input data to learn transitions from one state (label) to another. To enable such an algorithm, we need to define features which take into account different transitions. In the function word2features() below, we transform each word into a feature dictionary depicting the following attributes or features:

    lower case of word
    suffix containing last 3 characters
    suffix containing last 2 characters
    flags to determine upper-case, title-case, numeric data and POS tag
    

    We also attach attributes related to previous and next words or tags to determine beginning of sentence (BOS) or end of sentence (EOS)