unicodeutf-8asciismart-quotes

Is there a category or name for characters like smart quotes and that dash that always breaks?


Many have probably experienced copying some text from Word into a website form or something, and all the quotes ('), double quotes ("), and dashes (-) get garbled. I believe the quotes are called "Smart Quotes" or "Typographer's Quotes", but I don't know the name of the dash. Is there a category that includes these characters? Are there more?

Discerning features of this category: Accessible with normal qwerty keyboard, and is easily visually mistakable for its ASCII equivalent.

This question seems to be dealing with the same issue: How do I convert Word smart quotes and em dashes in a string? Also, perhaps they are called "em dashes"?


Solution

  • There are at least 1,114,111 valid Unicode code points. My US-standard keyboard makes those that fall between 1 and 127 (base 10) reasonably easy to access.

    When you venture beyond that range you start getting into either old style locales, or more modern UTF8 (or other Unicode) code points. Many of these code points are easily accessible from a keyboard somewhere in the world. But from the comfort of your own home or office, you'll find a fairly small subset of those 1.1 million to be easily accessible from your keyboard.

    There is a Unicode property called QMark (the short name), or Quotation_Mark (the long name), that includes 29 quotation style code points (in UTF8, hex): 0x0022, 0x0027, 0x00ab, 0x00bb, 0x2018, 0x2019, 0x201a, 0x201b, 0x201c, 0x201d, 0x201e, 0x201f, 0x2039, 0x203a, 0x300c, 0x300d, 0x300e, 0x300f, 0x301d, 0x301e, 0x301f, 0xfe41, 0xfe42, 0xfe43, 0xfe44, 0xff02, 0xff07, 0xff62, and 0xff63.

    Here's how they look (assuming your fonts support them all):

    "'«»‘’‚‛“”„‟‹›「」『』〝〞〟﹁﹂﹃﹄"'「」

    There happens to be a Unicode property ASCII, which not surprisingly contains 128 code points between 0 and 127.

    I can't seem to find a Unicode property that specifies "Everything that is not ASCII", but you will know it by virtue of the fact that it falls outside of the 0 .. 127 range.

    There is also a Hyphen Unicode property that contains eleven code points: 0x002d, 0x00ad, 0x058a, 0x1806, 0x2010, 0x2011, 0x2e17, 0x30fb, 0xfe63, 0xff0d, and 0xff65. I'm reluctant to paste them all here, as at least two of them don't render in my terminal. But here goes:

    -­֊᠆‐‑⸗・﹣-・

    As you can see, some are indistinguishable from others. When I use the Hyphen property in Perl 5.16 I get a warning that the particular Unicode property is deprecated. I don't know if that's just for Perl, or if it's for Unicode in general.

    There is also a Dash property containing 27 code points. I think you get the idea, so I won't enumerate them here. ...and another named Dash_Punctuation with 23 code points. Note that many code points can be categorized by more than one Unicode property, so it's possible that there is overlap between Hyphen and Dash, and probably even more overlap between Dash and Dash_Punctuation -- I don't know and haven't checked.

    I know this isn't a Perl-centric question by any means, but I've found that Perl has pretty good documentation of the Unicode properties here: perldoc perluniprops.

    So I guess the short answer to the question, "Are there more?" is yes, there are about 1.1 million more.

    Update: Regarding what these pesky characters are called.... You sort of have to differentiate between code points and glyphs. A code point is the unambiguous representation of a Unicode entity, whereas the glyph is what it looks like. Different fonts may implement a given glyph differently from each other. So what looks the same in one font may look a little different in another. Start thinking of Unicode code points, and their associated full names as having semantic meaning, whereas glyphs are simple graphical (unreliable) representations.

    Update 2: In some programming languages (specifically Perl, but possibly others) you may create custom character classes using set logic. In Perl, these are referred to as Extended Bracketed Character Classes, and are discussed in perldoc perlrecharclass. If you wanted to match all quotes that are not within the ASCII range, you could use this subexpression:

    (?[\p{QMark}-\p{ASCII}])
    

    The subexpression above creates a character class that matches all quote-like marks excluding those that come from the ASCII range. This is a feature that was introduced to Perl in Perl version 5.18. Given that this "Update 2" was added in 2019, and Perl 5.18 was released in 2013, the feature has been available for roughly four years. Unfortunately I find no indication that it has found its way into the PCRE libraries outside of Perl.

    Though it has been around for four years already, this feature (as of Perl 5.28) is still marked 'experimental'. Therefore, to use it you should add the following pragma in the scope where it is used:

    no warnings qw(experimental::regex_sets);
    

    That will squelch the experimental warning. I would not be surprised to see that warning lifted in a near-future release of Perl.