data-structuresbinary-search-treeprefixprefix-tree

What datastructure is the fastest to find the best matching prefix?


Context: I'm working on an analyzer for useragent strings (Yauaa) and as part of this analysis I want to make an educated guess what brand of the device should be reported. I have an implementation that I need to rewrite to be a lot more efficient.

Because I do not want to have a complete list of all devices I want to do the detection based on the prefix of the model.

So I have a dataset with prefixes and the brand that is associated:

And then I want to do a .get("GT-1234124") which should result in "Samsung" because that is the "longest matching prefix".

I had a look at the Trie structure but that seems to be for the opposite situation. What I understand is that you start with a set of values and you can efficiently get all the values that starts with the provided prefix.

If I were to implement this from scratch I would use a tree similar to the Trie but walk around it differently. What I'm looking for is a datastructure that does what I need as fast as possible.

What datastructure do you recommend for this usecase?

Is there an existing (proven) implementation I can use?


Solution

  • I did some digging into datastructures and found that essentially the Trie structure is what I need with a different way of walking around the structure.

    Since this structure is really simple I created my own implementation that works very well.

    See: https://github.com/nielsbasjes/yauaa/blob/master/analyzer/src/main/java/nl/basjes/parse/useragent/utils/PrefixLookup.java


    Updates:

    1. I wrote an article about this https://techlab.bol.com/finding-the-longest-matching-string-prefix-fast/
    2. I put my implementation into a separate library which I opensourced and which is already available via maven central. See https://github.com/nielsbasjes/prefixmap