pythonhtmlregexhtml-content-extraction

Extract part of a regex match


I want a regular expression to extract the title from a HTML page. Currently I have this:

title = re.search('<title>.*</title>', html, re.IGNORECASE).group()
if title:
    title = title.replace('<title>', '').replace('</title>', '') 

Is there a regular expression to extract just the contents of <title> so I don't have to remove the tags?


Solution

  • Use ( ) in regexp and group(1) in python to retrieve the captured string (re.search will return None if it doesn't find the result, so don't use group() directly):

    title_search = re.search('<title>(.*)</title>', html, re.IGNORECASE)
    
    if title_search:
        title = title_search.group(1)