regexbashshellsed

Append charset=utf-8 to a line that contains "text/<any text" where it doesn't already exist


I'm trying to use regex in a bash script to add charset=utf-8 to any line that contains "text/whatever", if it doesn't already exist. So that the line then look exactly like "text/whatever; charset=utf-8", and it does this for every line that matches the pattern and doesn't already have charset=utf-8.

I've tried the following and many iterations of it: '/mimetype.assign/,/)/ s/\(text\/.*\)\(.*\)\(charset=utf-8\)\{0\}/\1\2 charset=utf-8\"\,/'

The outputs i keep getting from this and variations of it is not correct. This is some of the output I'm getting from using sed.js.org:

 ".png"          =>      "image/png",
  ".xbm"          =>      "image/x-xbitmap",
  ".xpm"          =>      "image/x-xpixmap",
  ".xwd"          =>      "image/x-xwindowdump",
  ".css"          =>      "text/css; charset=utf-8", charset=utf-8",
  ".html"         =>      "text/html", charset=utf-8",
  ".htm"          =>      "text/html", charset=utf-8",
  ".js"           =>      "text/javascript", charset=utf-8",
  ".asc"          =>      "text/plain;charset=utf-8", charset=utf-8",
  ".c"            =>      "text/plain;charset=utf-8", charset=utf-8",
  ".cpp"          =>      "text/plain;charset=utf-8", charset=utf-8",
  ".log"          =>      "text/plain;charset=utf-8", charset=utf-8",
  ".conf"         =>      "text/plain;charset=utf-8", charset=utf-8",
  ".text"         =>      "text/plain;charset=utf-8", charset=utf-8",
  ".txt"          =>      "text/plain;charset=utf-8", charset=utf-8",
  ".spec"         =>      "text/plain;charset=utf-8", charset=utf-8",
  ".dtd"          =>      "text/xml", charset=utf-8",
  ".xml"          =>      "text/xml", charset=utf-8",
  ".mpeg"         =>      "video/mpeg",
  ".mpg"          =>      "video/mpeg",

You'll notice that it is adding it to the lines that don't have it but its not configured properly. Its also adding it to the lines that already have it and also misconfiguring it anyways. Maybe I'm over complicating it and there's a simpler regex cmd I can use.

The example below is a chunk of the file I'm trying to change. You'll notice that some of the lines that contain "text/whatever" are already configured properly

##  MimeType handling
## -------------------
##
## Use the "Content-Type" extended attribute to obtain mime type if
## possible
##
mimetype.use-xattr        = "disable" 

##
## mimetype mapping
##
mimetype.assign             = (
  ".ogg"          =>      "application/ogg",
  ".wav"          =>      "audio/x-wav",
  ".gif"          =>      "image/gif",
  ".jpg"          =>      "image/jpeg",
  ".jpeg"         =>      "image/jpeg",
  ".png"          =>      "image/png",
  ".xbm"          =>      "image/x-xbitmap",
  ".xpm"          =>      "image/x-xpixmap",
  ".xwd"          =>      "image/x-xwindowdump",
  ".css"          =>      "text/css; charset=utf-8",
  ".html"         =>      "text/html",
  ".htm"          =>      "text/html",
  ".js"           =>      "text/javascript",
  ".asc"          =>      "text/plain;charset=utf-8",
  ".c"            =>      "text/plain;charset=utf-8",
  ".cpp"          =>      "text/plain;charset=utf-8",
  ".log"          =>      "text/plain;charset=utf-8",
  ".conf"         =>      "text/plain;charset=utf-8",
  ".text"         =>      "text/plain;charset=utf-8",
  ".txt"          =>      "text/plain;charset=utf-8",
  ".spec"         =>      "text/plain;charset=utf-8",
  ".dtd"          =>      "text/xml",
  ".xml"          =>      "text/xml",
  ".mpeg"         =>      "video/mpeg",
 
# make the default mime type application/octet-stream.
  ""              =>      "application/octet-stream",

Thanks in advance for the input.


Solution

  • Use ! after the address specification to exclude lines that already contain charset=utf-8.

    Then change your pattern so it stops matching at the " character.

    sed '/charset=utf-8/!s#\("text/[^;"]*\)[^"]*#\1; charset=utf-8#'
    

    Explanation:

    In the replacement string & is replaced with the part of the line that matched the regular expression.