textunicodecharacter-encodingtextformat

Is there a character encoding or a markup language with a lettercase modifier?


Not sure if this fits here. If there is something like "Computer history and future" please direct me there.

Question

Since the rise of computers, were there any character encodings (or markup languages on top of that), that differentiate between uppercase and lowercase letters, but not by defining the entire alphabet twice (once in capitals and once in lowercase letters), but by adding a modifier or keyword that specifies a character to be in a specific case.

Why Would Someone Do This?

Maybe to encod text in less space, or simply because the authors considered the choice between ABC and abc more cosmetic than meaningful, which brings me to a lengthy and philosophical background explanation, see next section:


Skip everything from here if you are not interested in how I came up with this question.

Representation and Meaning

"Modern" encodings like ASCII and UTF-8 differentiate between uppercase and lowercase by assigning individual code points to each. This fundamental decision is so ubiquitous today, that concepts like case sensitivity appear rather natural to us. But when comparing Morse code, ASCII and Unicode, there are are a lot of distinctions that were traditionally stored in markup languages on top of the plain text encoding (e.g. rtf, tex, html, doc) but could be stored in plain text today:

Very old encodings like Braille and Morse code do not encode letter casing, but ASCII does. In fact, it forces you to pick either capitals or lowercase letters. There is no definitive default style if you don't care.

Unicode and its UTF encodings often continued on that route by forcing you to differentiate not only between letter cases, but also between regular, italic, bold; sans-serif, serif; script, Fraktur; and more. But Unicode also supports modifiers. Instead of defining the entire alphabet again, only underlined/colored/..., there are combining characters that behave similar to keywords in markup languages. A special (sequence of) code points indicates that the next symbol should be underlined / have a different color / ... .

Unicode aims at encoding meaning and not representation. We have all these seemingly cosmetic variants in Unicode, because they convey a different meaning to someone. However, the more "meaningful" distinctions are made, the more I get a feeling that standardizing meaning without representation is impossible. Some examples:

Purely cosmetic representation that became standardized meaning

Standardized meaning that changed based on the representation

An obscure mix of both

In an alternate universe ...

I wondered if history could have taken another turn, where people looked at these problems and thought: You know what? We cannot tell cosmetics and meaning apart. So lets try to create an encoding for the plainest of plain texts where you cannot even distinguish between uppercase and lowercase. Then add another encoding or markup language on top, that offers tons of modifiers or keywords to express whatever cosmetics you like.

In such a world, "plain text" could mean something like "a sequence of "regular" keystrokes" where computer keyboards send standardized and internationally unique scan codes.


Solution

  • were there any character encodings (or markup languages on top of that), that differentiate between uppercase and lowercase letters, but not by defining the entire alphabet twice (once in capitals and once in lowercase letters), but by adding a modifier or keyword that specifies a character to be in a specific case.

    That's pretty much exactly how ASCII works. The letter "A" is the bit-sequence 1x0 0001. The x defines which letter-case you want. Similarly, 000 0001 is "Control-A". It's also no accident that 001 0001 is the digit 1 (the digit equivalent of "A"). The leading two bits of ASCII sequences establish the kind of character the next 5 bits identify. The kind of modifier you're describing is sent in every byte. This is entirely on purpose. It allows extremely efficient hardware implementations for printing characters on a teletype.

    This can be very good for normalizing letters in search. You can just set bit 6 to 0 (or ignore bit 6), and then upper and lowercase letters are the same.

    In a different way, TTS (Teletypesetting) systems also had what you're describing. It was a modified Baudot code with an extra rail that allowed encoding of both upper-vs-lower case letters and standard-vs-italic.

    The key feature of Baudot code is that it shifts modes using LTRS and FIGS codes. (Be very careful when researching Baudot codes. "Lower case" generally means "uppercase letters" and "upper case" generally means figures. These harken back to the more literal meaning of "case.")

    6-unit TTS extended this by adding an additional rail allowing a "double-shift" to set letter-case and formatting (italic). This is very close to what you're describing.

    The big disadvantage to the "shift" approach is that it's not self-synchronizing. If you jump into the middle of a stream, you don't know how to display the characters because you don't know what mode you're in. So it's very nice to send modifiers directly on the code. But this makes codes larger.

    Many of the things you describe in your question about Unicode aren't quite for the reasons you're suggesting. For example, π’œ is MATHEMATICAL SCRIPT CAPITAL A, which expressly is not for style, but to convey specific semantic meaning. ("The characters in this block are intended for use only in mathematical or technical notation, and not in nontechnical text.") πŸ…° exists for backward compatibility with previous Japanese standards. It's not an indication that Unicode intends to encode cosmetics. ("Nearly all of the enclosed and square symbols in the Unicode Standard are considered compatibility characters, encoded for interoperability with other characters sets.")

    So lets try to create an encoding for the plainest of plain texts where you cannot even distinguish between uppercase and lowercase.

    This could be an entertaining hobby project, and possibly very educational. Best of luck with it. I would recommend studying the history and controversy around Han unification to get a feel for how complex these topics become in practice. Some starter questions just to think about:

    I would also study the history of UTF-16, and why UTF-8 has been so much more successful. Creating a new encoding that is not backward compatible with Latin-1 and requires substantially more space to store English is going to need some major advantages to actually be deployed. (See also IPv6.) But that impracticality should not dissuade you from exploring it.