I want to match the titles of h1
to h6
in an HTML file, without returning the h
tags themselves, using regular expressions.
Consider the following piece of an HTML file. I want to match "Welcome to my Homepage", "SQL", "RegEx", but not "This is not a valid HTML" (which is surrounded by a pair of unmatched tags).
<body>
<H1>Welcome to my Homepage</H1>
Content is divided into two sections:<br/>
<h2>SQL</h2>
Information about SQL.
<h2>RegEx</h2>
Information about Regular Expressions.
<h3>This is not a valid HTML</h4>
</body>
I use (?<=<[hH]([1-6])>).*?(?=<\/[hH]\1>)
at regex101.com. However, it also mathes the numbers 1
, 2
in the tags <H1>
and <h2>
.
How to fix it?
it also matches the numbers
1
,2
in the tags<H1>
and<h2>
.
Not really. The match itself captures only the content. The number comes from the capturing group in your lookbehind. You can just ignore that.