Find and replace curly quotes inside a character class

I'm getting strange results when I try to find and replace curly quotes inside a character class, with another character:

sed -E "s/[‘’]/'/g" in.txt > out.txt

in.txt:  ‘foo’
out.txt: '''foo'''

If you use a as a replacement, you'll get aaafooaaa. But this is only an issue when the curly quotes are inside a character class. This works:

sed -E "s/(‘|’)/'/g" in.txt > out.txt

in.txt:  ‘foo’
out.txt: 'foo'

Can anyone explain what's going on here? Can I still use a character class for curly quotes?

Solution

Your string is using a multibyte encoding, specifically UTF-8; the curly quotes are three bytes each. But your sed implementation is treating each byte as a separate character. This is probably due to your locale settings. I can reproduce your problem by setting my locale to "C" (the old default POSIX locale, which assumes ASCII):

$ LC_ALL=C sed -E "s/[‘’]/'/g" <<<'‘foo’' # C locale, single-byte chars
'''foo'''

But in my normal locale of en_US.UTF-8 ("US English encoded with UTF-8"), I get the desired result:

$ LC_ALL=en_US.UTF-8 sed -E "s/[‘’]/'/g" <<<'‘foo’' # UTF-8 locale, multibyte chars
'foo'

The way you're running it, sed doesn't see [‘‘] as a sequence of four characters but of eight. So each of the six bytes between the brackets – or at least, each of the four unique values found in those bytes – is considered a member of the character class, and each matching byte is separately replaced by the apostrophe. Which is why your three-byte curly quotes are getting replaced by three apostrophes each.

The version that uses alternation works because each alternate can be more than one character; even though sed is still treating ‘ and ’ as three-character sequences instead of individual characters, that treatment doesn't change the result.

So make sure your locale is set properly for your text encoding and see if that resolves your issue.