regexsednon-greedy

Sed Regular Expression affecting content after the Regex


I have an HTML file containing the following text:

<!doctype html><html><head><meta charset="utf-8"><title>Test</title><base href="/"><meta name="viewport" content="width=device-width,initial-scale=1"></head><body>test</body></html>

And I run this sed command against it:

sed -i -e "s:<base href\s*=\s*\".*\"\s*>:<base href=\"/apps/test/\">:g" /tmp/test/index.html

I'd expect for that just to replace <base href="/"> with <base href="/apps/test/"> and leave the rest alone, but it ends up affecting content after the regex:

 <!doctype html><html><head><meta charset="utf-8"><title>Test</title><base href="/apps/test/"></head><body>test</body></html>

It ended up removing the entire meta tag found after the regex. Am I just not doing the regex right?

GNU sed version 4.2.1

Solution

  • Because * is greedy, the .* in =\s*\".*\"\s*> matches to the furthest right > available.

    You can use single quotes around your command so you don't have to use \" for double quotes. Then, instead of ".*", you can use "[^"]*", which only matches to the next double quote.

    This would make your command into

    sed 's:<base href\s*=\s*"[^"]*"\s*>:<base href="/apps/test/">:g'
    

    However, manipulating HTML with sed and regexes is eternally brittle and will break at the first possible opportunity. You could use an XML/HTML parser such as xmllint, see Roman's answer; an alternative are the W3C HTML-XML-utils with their hxpipe and hxunpipe commands.

    These commands parse your HTML and turn it into a format easily processed with sed, awk & friends, then turn it back into HTML:

    $ hxpipe infile.html
    !html "" 
    (html
    (head
    Acharset CDATA utf-8
    (meta
    (title
    -Test
    )title
    Ahref CDATA /
    (base
    Aname CDATA viewport
    Acontent CDATA width=device-width,initial-scale=1
    (meta
    )head
    (body
    -test
    )body
    )html
    -\n
    

    so to turn the / in the href for the base tag into /apps/test/, we could do this:

    $ hxpipe infile.html \
        | sed '/Ahref CDATA/{N;/\n(base$/s|\(CDATA\) .*|\1 /apps/test/|}' \
        | hxunpipe
    <!DOCTYPE html><html><head><meta charset="utf-8"><title>Test</title><meta href="/apps/test/" name="viewport" content="width=device-width,initial-scale=1"></head><body>test</body></html>
    

    where the sed command

    sed '/Ahref CDATA/{N;/\n(base$/s|\(CDATA\) .*|\1 /apps/test/|}'
    

    or, better readable

    /Ahref CDATA/ {                                # If line matches this
        N                                          # Append next line
        /\n(base$/ s|\(CDATA\) .*|\1 /apps/test/|  # If in base tag, replace href
    }
    

    in a more or less robust fashion makes your change.