I have a file which contains logs from the web; a simplified version of it is as follows:
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
Unix
Linux
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
START
Solaris
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
Aix
SCO
I have tried a couple of Regex combinations to identify the Accept-Language which is the beginning of every line using the following with awk/sed:
/^[a-z]{2}(-[A-Z]{2})?/
/\*|[A-Z]{1,8}(-[A-Z0-9]{1,8})*/i
/([^-;]*)(?:-([^;]*))?(?:;q=([0-9]\.[0-9]))?/
So far I haven't managed to get either awk/sed to give me the following results:
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; STAR Solaris
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Aix SCO
Any help is appreciated. The file contains about 1 Million+ records so I'm happy to go down a route that doesn't use sed/awk and improves performance.
$ awk '/[a-z]{2}-[A-Z]{2}/ { print b; b=$0; next } # @xx-XX empty buffer, refill
{ b=b OFS $0 } # otherwise append to buffer
END { print b }' file # dump the buffer in the end
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; START Solaris
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Aix SCO
You will get an empty line to start the output with. Also, use tab delimiter on output if so desired: awk -v OFS="\t" ...
.