regexfileawksedhttp-accept-language

Merge lines which don't match a regex


I have a file which contains logs from the web; a simplified version of it is as follows:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Unix
Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
START
Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Aix
SCO

I have tried a couple of Regex combinations to identify the Accept-Language which is the beginning of every line using the following with awk/sed:

/^[a-z]{2}(-[A-Z]{2})?/
/\*|[A-Z]{1,8}(-[A-Z0-9]{1,8})*/i  
/([^-;]*)(?:-([^;]*))?(?:;q=([0-9]\.[0-9]))?/

So far I haven't managed to get either awk/sed to give me the following results:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Unix    Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    STAR    Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Aix    SCO

Any help is appreciated. The file contains about 1 Million+ records so I'm happy to go down a route that doesn't use sed/awk and improves performance.


Solution

  • $ awk '/[a-z]{2}-[A-Z]{2}/ { print b; b=$0; next }  # @xx-XX empty buffer, refill
                               { b=b OFS $0 }           # otherwise append to buffer
                           END { print b }' file        # dump the buffer in the end
    
    en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
    en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
    en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Unix Linux
    en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; START Solaris
    en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Aix SCO
    

    You will get an empty line to start the output with. Also, use tab delimiter on output if so desired: awk -v OFS="\t" ....