rubyregexgollum-wiki

Finding and Editing Multiple Regex Matches on the Same Line


I want to add markdown to key phrases in a (gollum) wiki page that will link to the relevant wiki page in the form:

This is the key phrase.

Becomes

This is the [[key phrase|Glossary#key phrase]].

I have a list of key phrases such as:

keywords = ["golden retriever", "pomeranian", "cat"]

And a document:

Sue has 1 golden retriever. John has two cats.
Jennifer has one pomeranian. Joe has three pomeranians.

I want to iterate over every line and find every match (that isn't already a link) for each keyword. My current attempt looks like this:

File.foreach(target_file) do |line|
    glosses.each do |gloss|
        len = gloss.length
        # Create the regex. Avoid anything that starts with [
        # or (, ends with ] or ), and ignore case.
        re = /(?<![\[\(])#{gloss}(?![\]\)])/i
        # Find every instance of this gloss on this line.
        positions = line.enum_for(:scan, re).map {Regexp.last_match.begin(0) }
        positions.each do |pos|
            line.insert(pos, "[[")
            # +2 because we just inserted 2 ahead.
            line.insert(pos+len+2, "|#{page}\##{gloss}]]")
        end
    end
    puts line
end

However, this will run into a problem if there are two matches for the same key phrase on the same line. Because I insert things into the line, the position I found for each match isn't accurate after the first one. I know I could adjust for the size of my insertions every time but, because my insertions are a different size for each gloss, it seems like the most brute-force, hacky solution.

Is there a solution that allows me to make multiple insertions on the same line at the same time without several arbitrary adjustments each time?


Solution

  • After looking at @BryceDrew's online python version, I realized ruby probably also has a way to fill in the match. I now have a much more concise and faster solution.

    First, I needed to make regexes of my glosses:

    glosses.push(/(?<![\[\(])#{gloss}(?![\]\)])/i)
    

    Note: The majority of that regex is look-ahead and look-behind assertions to prevent catching a phrase that's already part of a link.

    Then, I needed to make a union of all of them:

    re = Regexp.union(glosses)
    

    After that, it's as simple as doing gsub on every line, and filling in my matches:

    File.foreach(target_file) do |line|
      line = line.gsub(re) {|match| "[[#{match}|Glossary##{match.downcase}]]"}
      puts line
    end