I wrote this function findTokenOffset
that finds the offset of a given word in a pre-tokenized text (as a list of spaced words or according to a certain tokenizer).
import re, json
def word_regex_ascii(word):
return r"\b{}\b".format(re.escape(word))
def findTokenOffset(text,tokens):
seen = {} # map if a token has been see already!
items=[] # word tokens
my_regex = word_regex_ascii
# for each token word
for index_word,word in enumerate(tokens):
r = re.compile(my_regex(word), flags=re.I | re.X | re.UNICODE)
item = {}
# for each matched token in sentence
for m in r.finditer(text):
token=m.group()
characterOffsetBegin=m.start()
characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
found=-1
if word in seen:
found=seen[word]
if characterOffsetBegin > found:
# store last word has been seen
seen[word] = characterOffsetEnd
item['index']=index_word+1 #// word index starts from 1
item['word']=token
item['characterOffsetBegin'] = characterOffsetBegin
item['characterOffsetEnd'] = characterOffsetEnd
items.append(item)
break
return items
This code works ok when the tokens are single words like
text = "George Washington came to Washington"
tokens = text.split()
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
But, supposed to have tokens having a multi-token fashion like here:
text = "George Washington came to Washington"
tokens = ["George Washington", "Washington"]
offsets = findTokenOffset(text,tokens)
print(json.dumps(offsets, indent=2))
the offset does not work properly, due to repeating words in different tokens:
[
{
"index": 1,
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 16
},
{
"index": 2,
"word": "Washington",
"characterOffsetBegin": 7,
"characterOffsetEnd": 16
}
]
How to add support to multi-token and overlapped token regex matching (thanks to the suggestion in comments for this exact problem's name)?
If you do not need the search phrase/word index information in the resulting output, you can use the following approach:
import re,json
def findTokenOffset(text, pattern):
items = []
for m in pattern.finditer(text):
item = {}
item['word']=m.group()
item['characterOffsetBegin'] = m.start()
item['characterOffsetEnd'] = m.end()
items.append(item)
return items
text = "George Washington came to Washington Washington.com"
tokens = ["George Washington", "Washington"]
pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})(?!\w)(?!\.\b)', re.I )
offsets = findTokenOffset(text,pattern)
print(json.dumps(offsets, indent=2))
The output of the Python demo:
[
{
"word": "George Washington",
"characterOffsetBegin": 0,
"characterOffsetEnd": 17
},
{
"word": "Washington",
"characterOffsetBegin": 26,
"characterOffsetEnd": 36
}
]
The main part is pattern = re.compile(fr'(?<!\w)(?:{"|".join(sorted(map(re.escape, tokens), key=len, reverse=True))})\b(?!\.\b)', re.I )
that does the following:
map(re.escape, tokens)
- escapes special chars inside tokens
stringssorted(..., key=len, reverse=True)
- sorts the items in escaped tokens
by length in a descending order (so that Washigton Post
could match earlier than Washington
)"|".join(...)
- created an alternation list of tokens
, token1|token2|etc
(?<!\w)(?:...)(?!\w)(?!\.\b)
- is the final pattern that matches all the alternatives in tokens
as whole words. (?<!\w)
and (?!\w)
are used to enable word boundary detection even if the tokens
start/end with a special character.NOTE ON WORD BOUNDARIES
You should check your token boundary requirements. I added (?!\.\b)
as you mention that Washington
should not match in Washington.com
, so I inferred to want to fail any word match when it is immediately followed with .
and a word boundary. There are a lot of other possible solutions, the main one being whitespace boundaries, (?<!\S)
and (?!\S)
.
Besides, see Match a whole word in a string using dynamic regex.