algorithmdata-structures

Generate simple URL patterns from a list of examples


Let's say I have a large list of URLs that looks something like the following:

foo.com/abc/123
foo.com/abc/456
foo.com/abc/789
bar.com/11111/xyz
bar.com/22222/xyz
bar.com/33333/xyz
etc.

I would like to process this list and produce very simple glob-like templates that generalize a particular path segment if there are 3 or more URLs in the list that only differ by that path segment. The generalized path segment is replaced with a wildcard character (*).

So for the list above, I want my program to produce two templates:

foo.com/abc/*
bar.com/*/xyz

Note that the generalized path segment could appear anywhere in the URL. Is there an efficient algorithm for producing templates such as these? I don't need a full regular expression generator.


Solution

  • It depends on how complicated you want things to get. Segment replacement is pretty simple.

    Let's look first at the simplest case: common prefixes, as in your example for foo.com.

    Let me extend your example a little bit:

    foo.com/abc/123
    foo.com/abc/456
    foo.com/abc/789
    foo.com/123/abc
    foo.com/123/def
    foo.com/123/ghi
    

    What we're going to do is build a hierarchy (a tree) with foo.com at top, and two children: abc and 123. Each of the children will have three child nodes. So what you have is:

    foo.com
      abc
        123
        456
        789
      123
        abc
        def
        ghi
        
    

    It's easy to write a recursive program that traverses this structure to find the parents with leaf nodes, and then decide if you want to make a template. How you make that decision is unclear. Perhaps you say that if a node has three or more children, you create a template for it. So here you'd have foo.com/abc/* and foo.com/123/*.

    But you might also have:

    foo.com/abc/123/barby
    foo.com/abc/123/fooby
    foo.com/abc/123/foobidity
    

    You might then want an additional rule: foo.com/abc/123/*

    You can find these common sequences based on common substrings very easily.

    You can extend that method by transforming your URLs. So, given your bar.com example:

    bar.com/11111/xyz
    bar.com/22222/xyz
    bar.com/33333/xyz
    

    You transform the URLs to:

    bar.com/xyz/11111
    bar.com/xyz/22222
    bar.com/xyz/33333
    

    Then go through the hierarchy building process again. But here you want to make sure that there is exactly the same number of "xyz" as there are items in the second segment. For example if there were a fourth URL, bar.com/4444/xyz, you wouldn't want to generate the template bar.com/*/xyz.

    This technique is simple, and quite effective if you're looking for segment-based replacement patterns. It's reasonably efficient if the number of segments per URL isn't very large. But then if you have a site with more than, say, a half dozen hierarchy levels, you probably have other problems. Or not. Those sites typically already have an overall structure that you can exploit. Things like bigblogsite.com/user-name/section/year/month/day/title.

    It gets more difficult when you're looking for two-segment replacement, especially when the two segments aren't contiguous. For example, trying to find fooby.com/*/xyz/*/barby.