I have an HTML file containing the following text:
<!doctype html><html><head><meta charset="utf-8"><title>Test</title><base href="/"><meta name="viewport" content="width=device-width,initial-scale=1"></head><body>test</body></html>
And I run this sed
command against it:
sed -i -e "s:<base href\s*=\s*\".*\"\s*>:<base href=\"/apps/test/\">:g" /tmp/test/index.html
I'd expect for that just to replace <base href="/">
with <base href="/apps/test/">
and leave the rest alone, but it ends up affecting content after the regex:
<!doctype html><html><head><meta charset="utf-8"><title>Test</title><base href="/apps/test/"></head><body>test</body></html>
It ended up removing the entire meta
tag found after the regex. Am I just not doing the regex right?
GNU sed version 4.2.1
Because *
is greedy, the .*
in =\s*\".*\"\s*>
matches to the furthest right >
available.
You can use single quotes around your command so you don't have to use \"
for double quotes. Then, instead of ".*"
, you can use "[^"]*"
, which only matches to the next double quote.
This would make your command into
sed 's:<base href\s*=\s*"[^"]*"\s*>:<base href="/apps/test/">:g'
However, manipulating HTML with sed and regexes is eternally brittle and will break at the first possible opportunity. You could use an XML/HTML parser such as xmllint, see Roman's answer; an alternative are the W3C HTML-XML-utils with their hxpipe
and hxunpipe
commands.
These commands parse your HTML and turn it into a format easily processed with sed, awk & friends, then turn it back into HTML:
$ hxpipe infile.html
!html ""
(html
(head
Acharset CDATA utf-8
(meta
(title
-Test
)title
Ahref CDATA /
(base
Aname CDATA viewport
Acontent CDATA width=device-width,initial-scale=1
(meta
)head
(body
-test
)body
)html
-\n
so to turn the /
in the href
for the base
tag into /apps/test/
, we could do this:
$ hxpipe infile.html \
| sed '/Ahref CDATA/{N;/\n(base$/s|\(CDATA\) .*|\1 /apps/test/|}' \
| hxunpipe
<!DOCTYPE html><html><head><meta charset="utf-8"><title>Test</title><meta href="/apps/test/" name="viewport" content="width=device-width,initial-scale=1"></head><body>test</body></html>
where the sed command
sed '/Ahref CDATA/{N;/\n(base$/s|\(CDATA\) .*|\1 /apps/test/|}'
or, better readable
/Ahref CDATA/ { # If line matches this
N # Append next line
/\n(base$/ s|\(CDATA\) .*|\1 /apps/test/| # If in base tag, replace href
}
in a more or less robust fashion makes your change.