regexlinuxawksedgnu-sed

Regex for whitespace delimiter except for [ and ] characters


I consider my self pretty good with regular expressions, but this one is appearing to be surprisingly tricky.

I want to trim all whitespace, except the ones between "" and [] characters.

I used this regex ("[^"]*"|\S+)\s+ but did split the [06/Jan/2021:17:50:09 +0300] part of my log into two blocks.

Here is my entire log line :

[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""

Result I am getting based on my regex using sed command (replacing whitespace by comma):

[06/Jan/2021:17:50:09,+0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

Finally the result that I want to have :

[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""

Solution

  • Since these samples input looks like logs, so considering they will be always in same format; with this you could try following awk code, written and tested in shown samples in GNU awk.

    awk -v FPAT='[^]]*\\]|"[^"]*"|([0-9]+\\.){3}[0-9]+|[0-9]{2,4}' -v OFS="," '{$1=$1} 1'  Input_file
    

    Explanation:

    Explanation of regex:

    [^]]*\\]               ##Matching everything till ] followed by ] here.
    |                      ##OR
    "[^"]*"                ##Matching from " till first occurrence of " everything between them including "
    |                      ##OR
    ([0-9]+\\.){3}[0-9]+   ##Matching digits followed by dot 3 times followed by digits
    |                      ##OR
    [0-9]{2,4}             ##Matching 2 to 4 digits here.