smtp8-bit7-bit

Does SMTP transfer 7bit or 8bit characters (clear MSB or not?)


My understanding is that the original SMTP protocol was defined to limit transmission of characters using only 7 bits to save of transmission costs.

This protocol is almost 40 years old, and since then multiple RFCs have extended the standards.

For compatibility reasons, many if not most modern servers that are 8bit clean, perform a conversion of the message into a "7bit compatible" format, such as quoted-printable, or base64.

So technically, all the characters are 7bit ASCII.

However, the crux of my question is, even if data is encoded in a 7bit friendly way, does this mean that the physical transmission of bits between SMTP server occurs in 7bit units, or does it happen in 8bits?

My assumption is that it happens in 8bits, even if the data is encoded in ASCII. Is this correct?

Here are some relevant links I found:

<< Users send billions of 8-bit messages every year. As far as I know, all servers can handle 8-bit messages. A few years ago I was able to find a few hosts running ancient 7-bit versions of sendmail, but I don't see any now.>>

http://cr.yp.to/smtp/8bitmime.html

<< In practice, however, the body is typically encoded using all eight bits. >>

https://www.ibm.com/support/knowledgecenter/en/SSB27U_6.4.0/com.ibm.zvm.v640.kiml0/smtmlfr.htm

<< This does not cause problems in practice, since virtually all modern mail relays are 8-bit clean >>

https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol#8BITMIME

Update

The refinement of my question should be stated as: Do SMTP servers today still clear the high bit, and encode the 7bit ASCII using only the lower seven bits, or do they actually use the full octet, giving signinficance to the MSB?


Solution

  • I think what you are asking is: "Do SMTP clients shift bits when sending messages to an SMTP server such that each character only uses 7 bits and the 8th bit is the start of the next character?"

    If so, no. That has never been the case.

    Since the very beginning, SMTP clients/servers have always used all 8 bits per character.

    In other words, SMTP clients and servers used the ASCII character encoding which does not include accented characters that are found in 8bit character encodings such as ISO-8859-1. Characters with a value above 127 in the ASCII encoding are treated as undefined.

    There are likely a number of reasons for this:

    1. ASCII is simple to support
    2. Every locale had their own preferred extended character encoding that was not compatible with other locales - some of which required more than a single byte to represent a character.
    3. I'm not sure if UTF-8 existed yet (but multibyte unicode did, I think - e.g. UCS2 / UTF-16)
    4. It was difficult and unrealistic to expect so much software to implement character set conversion between all of the widely used character sets (unicode and charset conversion libraries were not as widely available at the time)
    5. The "MESSAGE" specification that preceded MIME, SMTP, etc. was written for the US "internet" and likely didn't need anything outside of ASCII (hence why the original message specifications e.g. rfc0822 and earlier did not define encoding mechanisms).