arraysstringalgorithm.net-8.0c#-12.0

From a list of groups of words, find if any of them appears in a string and return which word group and which index in the string?


Perhaps I'm just tired, but it seems I can't fathom an easy way to do this.

I have a list of word groups, where each group contains a list of synonyms, e.g. like this:

[
    ["food", "grub", "meal"],
    ["cat", "feline"]
]

Now, from an input string I want to find if it contains any of these words, and in that case I need to know which word, from which group, and where in the string.

We can assume all lowercase and we can also assume that only a single word from the synonyms can occur in the input string.

It would be OK to split the string into words and get the word's index in that array, but a character position within the original input string would also be just fine.

The list of synonym groups is a constant (static readonly) and I can change the data structure if it makes the operation simpler. I could also derive some secondary data structure to support the operation.

My thoughts go along the lines of building a regex with all the words and search for that, which would return the exact word and pos in the input string, and then loop through the synonym groups to find in which group that word appears.

Or, build a dictionary keyed on synonym words, and for each one put the synonym group as the value... That one could be used together with:

string[] synonymGroup;
var inputWords = inputString.Split(' ');
var wordIndex = Array.FindIndex(inputWords, w => dict.TryGetValue(w, out synonymGroup));

Well... Any less complicated suggestions?

C# 12 and .NET 8.


Solution

  • Since the synonyms to search for are fixed, you should naturally make an index data structure. There are several options for which:

    If you can split a string into a sequence of words, it's hashing or trie.

    As an aside, if you wanted to search for substrings, not for whole tokens (words) in a sequence of tokens, then there's something called Aho-Corasick tree that again gives guaranteed construction (done at once, can't insert one by one) and search time linear in lengths of your strings. Search works by stepping over characters of your text and for each prefix, you get a node describing all prefixes of your pattern strings that end at the current position.