arrays string algorithm .net-8.0 c#-12.0

From a list of groups of words, find if any of them appears in a string and return which word group and which index in the string?

Perhaps I'm just tired, but it seems I can't fathom an easy way to do this.

I have a list of word groups, where each group contains a list of synonyms, e.g. like this:

[
    ["food", "grub", "meal"],
    ["cat", "feline"]
]

Now, from an input string I want to find if it contains any of these words, and in that case I need to know which word, from which group, and where in the string.

We can assume all lowercase and we can also assume that only a single word from the synonyms can occur in the input string.

It would be OK to split the string into words and get the word's index in that array, but a character position within the original input string would also be just fine.

The list of synonym groups is a constant (static readonly) and I can change the data structure if it makes the operation simpler. I could also derive some secondary data structure to support the operation.

My thoughts go along the lines of building a regex with all the words and search for that, which would return the exact word and pos in the input string, and then loop through the synonym groups to find in which group that word appears.

Or, build a dictionary keyed on synonym words, and for each one put the synonym group as the value... That one could be used together with:

string[] synonymGroup;
var inputWords = inputString.Split(' ');
var wordIndex = Array.FindIndex(inputWords, w => dict.TryGetValue(w, out synonymGroup));

Well... Any less complicated suggestions?

C# 12 and .NET 8.

Solution

Since the synonyms to search for are fixed, you should naturally make an index data structure. There are several options for which:

If you can split a string into a sequence of words, it's hashing or trie.

Hashing and hash table = the dictionary you mention. The main disadvantages are that this depends on your choice of hashing function (or the quality of the existing dictionary implementation you choose) and worst-case performance isn't guaranteed to be fast, but there's plenty of existing libraries to choose from.
Trie/prefix tree is a tree in which each path from root to a node corresponds to a prefix of one of your words, and the nodes corresponding to full words are marked with a group you specified. In each node, you store a map (broadly speaking - it's just a suitable data structure) of next character -> "pointer" to possible next node used for extending the current prefix. The number of nodes traversed for inserting and searching a word is guaranteed to be O(inserted/searched word length), but depending on how the map of next nodes is implemented, real runtime varies and memory can grow rather large. For a small simple alphabet like you should have here, I prefer using a fixed-size array next_node[index_of_character] (null if such next prefix hasn't been inserted) in each node, which costs memory O(alphabet size * number of distinct prefixes among your words). There should be existing C# implementations for this as well.

As an aside, if you wanted to search for substrings, not for whole tokens (words) in a sequence of tokens, then there's something called Aho-Corasick tree that again gives guaranteed construction (done at once, can't insert one by one) and search time linear in lengths of your strings. Search works by stepping over characters of your text and for each prefix, you get a node describing all prefixes of your pattern strings that end at the current position.