regexhtml-parsingregex-lookaroundsnon-greedy

How to match only the titles between <h></h> tags, without returning the tags themselves, using regular expressions?


I want to match the titles of h1 to h6 in an HTML file, without returning the h tags themselves, using regular expressions.

Consider the following piece of an HTML file. I want to match "Welcome to my Homepage", "SQL", "RegEx", but not "This is not a valid HTML" (which is surrounded by a pair of unmatched tags).

<body>
  <H1>Welcome to my Homepage</H1>
  Content is divided into two sections:<br/>
  <h2>SQL</h2>
  Information about SQL.
  <h2>RegEx</h2>
  Information about Regular Expressions.
  <h3>This is not a valid HTML</h4>
</body>

I use (?<=<[hH]([1-6])>).*?(?=<\/[hH]\1>) at regex101.com. However, it also mathes the numbers 1, 2 in the tags <H1> and <h2>.

How to fix it?


Solution

  • it also matches the numbers 1, 2 in the tags <H1> and <h2>.

    Not really. The match itself captures only the content. The number comes from the capturing group in your lookbehind. You can just ignore that.