pythontokenizestringtokenizer

Python find offsets of a word token in a text


I wrote this function findTokenOffset that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).

import re, json

def word_regex_ascii(word):
    return r"\b{}\b".format(re.escape(word))

def findTokenOffset(text,tokens):
  seen = {} # map if a token has been see already!
  items=[] # word tokens
  my_regex = word_regex_ascii
  # for each token word
  for index_word,word in enumerate(tokens):

      r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)

      item = {}
      # for each matched token in sentence
      for m in r.finditer(text):

          token=m.group()
          characterOffsetBegin=m.start()
          characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
          
          found=-1
          if word in seen:
              found=seen[word]
          
          if characterOffsetBegin > found:
              # store last word has been seen
              seen[word] = characterOffsetEnd
              item['index']=index_word+1 #// word index starts from 1
              item['word']=token
              item['characterOffsetBegin'] = characterOffsetBegin
              item['characterOffsetEnd'] = characterOffsetEnd
              items.append(item)

              break
  return items

This code works ok when the tokens are single words like

text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2)) 

But, supposed to have tokens having a multi-token fashion like here:

text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2)) 

the offset does not work properly, due to repeating words in different tokens:

[
  {
    "index": 1,
    "word": "George Washington",
    "characterOffsetBegin": 0,
    "characterOffsetEnd": 16
  },
  {
    "index": 2,
    "word": "Washington",
    "characterOffsetBegin": 7,
    "characterOffsetEnd": 16
  }
]

How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?


Solution

  • If you do not need the search phrase/word index information in the resulting output, you can use the following approach:

    import re,json
     
    def findTokenOffset(text, pattern):
        items = []
        for m in pattern.finditer(text):
            item = {}
            item['word']=m.group()
            item['characterOffsetBegin'] = m.start()
            item['characterOffsetEnd'] = m.end()
            items.append(item)
        return items
     
    text = "George Washington came to Washington Washington.com"
    tokens = ["George Washington", "Washington"]
    pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})(?!\w)(?!\.\b)', re.I )
    offsets = findTokenOffset(text,pattern)
    print(json.dumps(offsets, indent=2)) 
    

    The output of the Python demo:

    [
      {
        "word": "George Washington",
        "characterOffsetBegin": 0,
        "characterOffsetEnd": 17
      },
      {
        "word": "Washington",
        "characterOffsetBegin": 26,
        "characterOffsetEnd": 36
      }
    ]
    

    The main part is pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})\b(?!\.\b)', re.I ) that does the following:

    NOTE ON WORD BOUNDARIES

    You should check your token boundary requirements. I added (?!\.\b) as you mention that Washington should not match in Washington.com, so I inferred to want to fail any word match when it is immediately followed with . and a word boundary. There are a lot of other possible solutions, the main one being whitespace boundaries, (?<!\S) and (?!\S).

    Besides, see Match a whole word in a string using dynamic regex.