regexvisual-studio-codefind-replace

Advanced VS Regex Find/Replace: use string inside <h1> to add another <td> below each occuring <td>


An example describes it better. Suppose you have a structure like this:

<h1>TITLE OF HEAD 1</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 1, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 2, AFTER HEAD 1</td>
        </tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 3, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 4, AFTER HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 5, AFTER HEAD 1</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 6, AFTER HEAD 2</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 7, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 8, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 9, AFTER HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 10, AFTER HEAD 3</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 11, AFTER HEAD 4</td>
        </tr>
        <tr>
            <td class="one">ITEM 12, AFTER HEAD 4</td>
        </tr>
    </tbody>
</table>

And with regex, the outcome should be:

<table>
    <tbody>
        <tr>
            <td class="one">ITEM 1, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 2, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
    </tbody>
</table>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 3, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 4, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
        <tr>
            <td class="one">ITEM 5, AFTER HEAD 1</td>
            <td class="two">TITLE OF HEAD 1</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 2</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 6, AFTER HEAD 2</td>
            <td class="two">TITLE OF HEAD 2</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 3</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 7, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 8, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 9, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
        <tr>
            <td class="one">ITEM 10, AFTER HEAD 3</td>
            <td class="two">TITLE OF HEAD 3</td>
        </tr>
    </tbody>
</table>
<h1>TITLE OF HEAD 4</h1>
<table>
    <tbody>
        <tr>
            <td class="one">ITEM 11, AFTER HEAD 4</td>
            <td class="two">TITLE OF HEAD 4</td>
        </tr>
        <tr>
            <td class="one">ITEM 12, AFTER HEAD 4</td>
            <td class="two">TITLE OF HEAD 4</td>
        </tr>
    </tbody>
</table>

What I've tried so far:

Now getting the strings inside the <h1> is easy:

find: (<h1>)(.*?)(</h1>) replace: $2

Then I tried:

find: (<h1>)(.*?)(</h1>)(\n|.)*?(<td class="one">.*?</td>) replace: $5<td class="two">$2</td>

which works, but the other tags are removed as well, so I've modified it:

find (<h1>)(.*?)(</h1>)((\n|.)*?)(<td class="one">.*?</td>) replace: $4$6<td class="two">$2</td>

Each string of a new h1 will be used for the tds that occur afterwards until a new h1 occurs, which will then be used - the problem is this only works for each first tdafter each h1, not all tds.

Could somebody tell me what needs to be added to the regex for this to work?

Thank you!


Solution

  • Use

    <h1>([^<]*)<\/h1>\s*\n([\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)|(?<=<h1>([^<]*)<\/h1>[\w\W]*?)(([^\n\S]*)<td\s.*?<\/td>(\n))(?=\s*<\/tr>)
    

    See regex proof.

    Replace with: $2$3$4$7$8<td class="two">$1$6</td>$5$9.

    EXPLANATION

    NODE                     EXPLANATION
    --------------------------------------------------------------------------------
      <h1>                     '<h1>'
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        [^<]*                    any character except: '<' (0 or more
                                 times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      </h1>                    '</h1>'
    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      \n                       '\n' (newline)
    --------------------------------------------------------------------------------
      (                        group and capture to \2:
    --------------------------------------------------------------------------------
        [\w\W]*?                 any character of: word characters (a-z,
                                 A-Z, 0-9, _), non-word characters (all
                                 but a-z, A-Z, 0-9, _) (0 or more times
                                 (matching the least amount possible))
    --------------------------------------------------------------------------------
      )                        end of \2
    --------------------------------------------------------------------------------
      (                        group and capture to \3:
    --------------------------------------------------------------------------------
        (                        group and capture to \4:
    --------------------------------------------------------------------------------
          [^\n\S]*                 any character except: '\n' (newline),
                                   non-whitespace (all but \n, \r, \t,
                                   \f, and " ") (0 or more times
                                   (matching the most amount possible))
    --------------------------------------------------------------------------------
        )                        end of \4
    --------------------------------------------------------------------------------
        <td                      '<td'
    --------------------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
    --------------------------------------------------------------------------------
        .*?                      any character except \n (0 or more times
                                 (matching the least amount possible))
    --------------------------------------------------------------------------------
        </td>                    '</td>'
    --------------------------------------------------------------------------------
        (                        group and capture to \5:
    --------------------------------------------------------------------------------
          \n                       '\n' (newline)
    --------------------------------------------------------------------------------
        )                        end of \5
    --------------------------------------------------------------------------------
      )                        end of \3
    --------------------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    --------------------------------------------------------------------------------
        \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                                 or more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        </tr>                    '</tr>'
    --------------------------------------------------------------------------------
      )                        end of look-ahead
    --------------------------------------------------------------------------------
     |                        OR
    --------------------------------------------------------------------------------
      (?<=                     look behind to see if there is:
    --------------------------------------------------------------------------------
        <h1>                     '<h1>'
    --------------------------------------------------------------------------------
        (                        group and capture to \6:
    --------------------------------------------------------------------------------
          [^<]*                    any character except: '<' (0 or more
                                   times (matching the most amount
                                   possible))
    --------------------------------------------------------------------------------
        )                        end of \6
    --------------------------------------------------------------------------------
        </h1>                    '</h1>'
    --------------------------------------------------------------------------------
        [\w\W]*?                 any character of: word characters (a-z,
                                 A-Z, 0-9, _), non-word characters (all
                                 but a-z, A-Z, 0-9, _) (0 or more times
                                 (matching the least amount possible))
    --------------------------------------------------------------------------------
      )                        end of look-behind
    --------------------------------------------------------------------------------
      (                        group and capture to \7:
    --------------------------------------------------------------------------------
        (                        group and capture to \8:
    --------------------------------------------------------------------------------
          [^\n\S]*                 any character except: '\n' (newline),
                                   non-whitespace (all but \n, \r, \t,
                                   \f, and " ") (0 or more times
                                   (matching the most amount possible))
    --------------------------------------------------------------------------------
        )                        end of \8
    --------------------------------------------------------------------------------
        <td                      '<td'
    --------------------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
    --------------------------------------------------------------------------------
        .*?                      any character except \n (0 or more times
                                 (matching the least amount possible))
    --------------------------------------------------------------------------------
        </td>                    '</td>'
    --------------------------------------------------------------------------------
        (                        group and capture to \9:
    --------------------------------------------------------------------------------
          \n                       '\n' (newline)
    --------------------------------------------------------------------------------
        )                        end of \9
    --------------------------------------------------------------------------------
      )                        end of \7
    --------------------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    --------------------------------------------------------------------------------
        \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                                 or more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        </tr>                    '</tr>'
    --------------------------------------------------------------------------------
      )                        end of look-ahead