regexstringsed

Remove commas within double quoted strings with sed


I have a string:

1,2,3,"test1,test2,test3",4,5,6

From it, I need to get an output that has no commas within double quotations like this:

1,2,3,"test1test2test3",4,5,6

I tried the following:

sed -i -e "s/,"([^,"]*)"//g" test.txt

but it returned the original text.


Solution

  • The immediate problem with your attempt is that you are mixing syntactic and literal quotes. But the shell doesn't know that; it regards them all as syntactic, and basically removes them. So you end up executing

    s/,   # originally quoted
    ([^,  # unquoted
    ]*)   # quoted
    //g" test.txt
    

    (all jammed together into one string) ... except the last part isn't even valid syntax because there's an opening double quote without a closing quote.

    Anyway, the simple fix is to switch the syntactic outer quotes to single quotes; then you can have literal double quotes inside that string.

    sed -e 's/,"([^,"]*)"//g' test.txt
    

    However, that's still broken, because you are replacing the commas and the non-commas with nothing. You apparently meant something like

    sed -e 's/\(,"([^,"]*)\),\([^"]*"\)/\1\2/g' test.txt
    

    which says to match the opening quote after a comma, up through just before the first comma (call this the first capturing group); then a comma, ungrouped; then any non-quotes, and finally the closing double quote (second capture group). Replace all this with just the captured groups, i.e. remove the comma but keep the other text which you matched just to find the correct context for the comma you wanted to target.

    (Also, you haven't required the first quote to be the opening quote. It will work for your examples, but could break for more complex cases.)

    But because of this anchoring to find the comma in the correct context, you are only replacing one comma. Perhaps you were hoping that the /g flag would help with that, but it doesn't, here. The significance of the /g flag is to replace all occurrences on a line, but that means non-overlapping repetitions, not keep on replacing over the whole line. So your attempt (or rather, this attempt of mine based on yours) would remove the first quoted comma (and in your case, everything else that you matched) and then skip ahead to the next instance of a quote followed by a comma and replace that, as long as more are found; but not find more commas inside the quoted string it already matched.

    sed doesn't easily allow you to say "find a quoted string, and then only replace commas within that" (it can be done, just like you could write the game of Tetris in any Turing-complete language; but that doesn't mean it will be fun or easy).

    Ultimately, if you are stuck with sed, perhaps replace one comma inside double quotes at a time until there are none left.

    sed '
    # anchor for loop
    :0
    # is there a quoted comma?
    /^\([^"]*,"[^,"]*\),/{
      # yes, replace it, but keep everything before it
      s//\1/
      # loop back to 0
      b0
    }
    # nope, we are done
    ' test.txt
    

    Demo: https://ideone.com/VFnkNs

    As per your requirements, this targets the first quoted string, and doesn't attempt to find commas in any subsequent quoted strings. It would not be hard to extend it to do that; basically, allow zero or more repetitions of unquoted values or quoted strings without commas in them before the quoted string in which you actually perform the substitution (and again, remember to capture this text and replace it with itself, rather than remove it!). Making sure you anchor to the beginning of the string is how you ensure that you are able to distinguish between opening and closing double quotes.

    Finally, probably avoid the -i option until you have a solution which you have verified produces the correct output.

    If you aren't stuck with sed, here's a simple Awk script which works by setting the field separator to double quote, then removing commas in all even-numbered fields.

    awk -F '"' 'BEGIN { OFS=FS }
      { for(i=2; i<=NF; i+=2)
        gsub(/,/, "", $i) }1' file.txt
    

    If you have GNU Awk, you can also use -i inplace to write results back to the input file, similarly to how the (also nonstandard) -i option of sed works.

    Demo: https://ideone.com/Yvoe00