regexexpresso

Having problems with a specific regex statement


I'm trying to write a script for Woofy (tl;dr a program that downloads webcomics), but apparently my regex expression to find the link to the previous page isn't working, according to Expresso. I'm trying to find something along the lines of:

<a href="http://70-seas.com/?p=1253" title="Prologue 01" class="previous-comic-link"><span>&lsaquo; Previous</span></a>

that varies with each page, with the URL and title changing to link to whatever the previous page was, with:

<a\shref="http://70-seas.com/?p=[0-9]{4}"\stitle="[.]*\s[.]*\s([.]*)?"\sclass="previous-comic-link"><span>&lsaquo;\sPrevious</span></a>

(Sometimes the titles have three words, sometimes they have two. They always have numbers as the last word, though.)

Given that I have no prior experience or formal training whatsoever with regex, I have no idea what I'm doing wrong. Any help would be appreciated.


Solution

  • There are a few things to address.

    First, take a look at the http://70-seas.com/?p=[0-9]{4} portion. The /? here means the / character is optional. To match the URL you want to match a literal ? character. Since the ? character is a regex metacharacter, which makes something optional, you need to escape it to make it match literally. To do so, use \?. Thus, the updated portion becomes http://70-seas.com/\?p=[0-9]{4}. You also added the expresso tag, so you could walk the pattern tree and spot this issue.

    Next, the real problem is with this portion: title="[.]*\s[.]*\s([.]*)?". Square brackets in regex denote a character class which matches any of the characters inside. The [.] really means "match the '.' character" which isn't your intention. You probably wanted to use the . metacharacter which matches any character, but instead ended up matching it literally by placing it inside a character class. Next, you made the third portion optional, when you probably meant to include the \s inside that last optional group. With these points in mind, you should've used: title=".*\s.*(\s.*)?".

    That should work. However, it's not the best regex and the use of .* is usually a red-flag for me. . matches any character, and it's a greedy pattern that can potentially consume more than intended. It is best to try and be specific. If you want to match alphanumeric characters, use \w instead. Based on your description, you expect 1-3 words. This can be expressed as \w+(?:\s\w+){0,2}. Much cleaner and easier to understand. It indicates that we're matching one or more alphanumeric character, followed by a non-capturing group of (?:\s\w+) which means match a whitespace then one or more alphanumeric characters again. Finally, we place the {0,2} quantifier at the end of the group to indicate that we want to match this group 0-2 times. The (?:...) syntax makes the group non-capturing if you don't need captures, which enhances performance.

    One thing you should also do is escape all double-quotes. It may or may not make a difference depending on what you're using, but it's commonly needed. So your double quotes would become \".

    You should now have a pattern like this:

    <a\shref=\"http://70-seas.com/\?p=[0-9]{4}\"\stitle=\"\w+(?:\s\w+){0,2}\"\sclass=\"previous-comic-link\"><span>&lsaquo;\sPrevious</span></a>
    

    That's great, but this could be even simpler. Whenever you have to match some content between double quotes, and as long as you don't need to match any of the items within to refer to later, then you can simplify this by using title=\"[^"]+\". The [^"]+ part uses a negative character class, which is indicated by the ^ character at the start of the character class. It essentially matches any character that is not a double-quote. The match will terminate once it encounters the double-quote at the end of the title. No need to worry about 1-3 words since you just want to match the entire content of the title.

    The new pattern becomes:

    <a\shref=\"http://70-seas.com/\?p=[0-9]{4}\"\stitle=\"[^"]+\"\sclass=\"previous-comic-link\"><span>&lsaquo;\sPrevious</span></a>