regexhtml-parsingkantu

RegEx to capture everything between two strings but avoid capturing commas


Hello StackOverflow Community Kindly review the following print: enter image description here

As you can see with I'm capturing everything between <title> and </title> brackets, but I want to avoid capturing any commas that might exist in the text.

Currently I get:

Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4", 3/8" &amp; 1/2" Drive Monster Green

what I want to get:

Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4" 3/8" &amp; 1/2" Drive Monster Green

I need a one line regex command that does that for me. Any ideas?

This is the regex command that I use:

(?<=<title\>)(.*?)(?=\s*\<)

Sample text is:

<title>Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4", 3/8" &amp; 1/2" Drive Monster Green</title>

I'm using Kantu Browser Automation to extract the title of some webpages. Bear in mind that I'm scraping the whole web page HTML.

If is not possible to do this, then what about matching until the first comma and then return, for example return this:

Kincrome K1500G - Tool Workshop Contour 472 Piece 15 Drawer 1/4"

Thank you for your time.


Solution

  • As mentioned in comments, a regular expression can't alter the text that was matched, it just matches something or not.

    If you're willing to stop the match at the first comma, rather than including all the rest with the commas removed, you can use this:

    (?<=<title\>)(.*?)(?=(,|\s*<\/title>))
    

    https://regex101.com/r/PPb1ba/1