I consider my self pretty good with regular expressions, but this one is appearing to be surprisingly tricky.
I want to trim all whitespace, except the ones between ""
and []
characters.
I used this regex ("[^"]*"|\S+)\s+
but did split the [06/Jan/2021:17:50:09 +0300] part of my log into two blocks.
Here is my entire log line :
[06/Jan/2021:17:50:09 +0300] "" 10.139.3.194 407 "CONNECT clients5.google.com:443 HTTP/1.1" "" "-" "" 4245 75 "" "" "81" ""
Result I am getting based on my regex using sed command (replacing whitespace by comma):
[06/Jan/2021:17:50:09,+0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""
Finally the result that I want to have :
[06/Jan/2021:17:50:09 +0300],"",10.139.3.194,407,"CONNECT clients5.google.com:443 HTTP/1.1","","-","",4245,75,"","","81",""
Since these samples input looks like logs, so considering they will be always in same format; with this you could try following awk
code, written and tested in shown samples in GNU awk
.
awk -v FPAT='[^]]*\\]|"[^"]*"|([0-9]+\\.){3}[0-9]+|[0-9]{2,4}' -v OFS="," '{$1=$1} 1' Input_file
Explanation:
awk
here. Which has FPAT
option available in it.OFS
(output field separator) as ,
also for all lines.awk
resetting line(by resetting 1st field) to apply OFS value to it as per OP's requirement. Which will make sure that commas should come in output as per need only.Explanation of regex:
[^]]*\\] ##Matching everything till ] followed by ] here.
| ##OR
"[^"]*" ##Matching from " till first occurrence of " everything between them including "
| ##OR
([0-9]+\\.){3}[0-9]+ ##Matching digits followed by dot 3 times followed by digits
| ##OR
[0-9]{2,4} ##Matching 2 to 4 digits here.