Let's say I have a large list of URLs that looks something like the following:
foo.com/abc/123
foo.com/abc/456
foo.com/abc/789
bar.com/11111/xyz
bar.com/22222/xyz
bar.com/33333/xyz
etc.
I would like to process this list and produce very simple glob-like templates that generalize a particular path segment if there are 3 or more URLs in the list that only differ by that path segment. The generalized path segment is replaced with a wildcard character (*).
So for the list above, I want my program to produce two templates:
foo.com/abc/*
bar.com/*/xyz
Note that the generalized path segment could appear anywhere in the URL. Is there an efficient algorithm for producing templates such as these? I don't need a full regular expression generator.
It depends on how complicated you want things to get. Segment replacement is pretty simple.
Let's look first at the simplest case: common prefixes, as in your example for foo.com.
Let me extend your example a little bit:
foo.com/abc/123
foo.com/abc/456
foo.com/abc/789
foo.com/123/abc
foo.com/123/def
foo.com/123/ghi
What we're going to do is build a hierarchy (a tree) with foo.com at top, and two children: abc and 123. Each of the children will have three child nodes. So what you have is:
foo.com
abc
123
456
789
123
abc
def
ghi
It's easy to write a recursive program that traverses this structure to find the parents with leaf nodes, and then decide if you want to make a template. How you make that decision is unclear. Perhaps you say that if a node has three or more children, you create a template for it. So here you'd have foo.com/abc/*
and foo.com/123/*
.
But you might also have:
foo.com/abc/123/barby
foo.com/abc/123/fooby
foo.com/abc/123/foobidity
You might then want an additional rule: foo.com/abc/123/*
You can find these common sequences based on common substrings very easily.
You can extend that method by transforming your URLs. So, given your bar.com example:
bar.com/11111/xyz
bar.com/22222/xyz
bar.com/33333/xyz
You transform the URLs to:
bar.com/xyz/11111
bar.com/xyz/22222
bar.com/xyz/33333
Then go through the hierarchy building process again. But here you want to make sure that there is exactly the same number of "xyz" as there are items in the second segment. For example if there were a fourth URL, bar.com/4444/xyz
, you wouldn't want to generate the template bar.com/*/xyz
.
This technique is simple, and quite effective if you're looking for segment-based replacement patterns. It's reasonably efficient if the number of segments per URL isn't very large. But then if you have a site with more than, say, a half dozen hierarchy levels, you probably have other problems. Or not. Those sites typically already have an overall structure that you can exploit. Things like bigblogsite.com/user-name/section/year/month/day/title.
It gets more difficult when you're looking for two-segment replacement, especially when the two segments aren't contiguous. For example, trying to find fooby.com/*/xyz/*/barby
.