htmlformsunicodenewlineenctype

Do all kinds of newlines get converted to \r\n when submitted through a html form?


The specification from w3c states the following for forms of enctype=application/x-www-form-urlencoded:

This is the default content type. Forms submitted with this content type must be encoded as follows:

1) Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').

2) The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by&'.

There are a few kinds of line terminators in Unicode. Namely:

 LF:    Line Feed, U+000A
 VT:    Vertical Tab, U+000B
 FF:    Form Feed, U+000C
 CR:    Carriage Return, U+000D
 CR+LF: CR (U+000D) followed by LF (U+000A)
 NEL:   Next Line, U+0085
 LS:    Line Separator, U+2028
 PS:    Paragraph Separator, U+2029

Are all of these converted to CR LF (\r\n)?


Solution

  • Are all of these converted to CR LF (\r\n)?

    Nope. The HTML4 spec here is unclear on what a line break is, but what browsers do, and what HTML5 has gone on to standardise is that only CR and LF are involved:

    replace every occurrence of a "CR" (U+000D) character not followed by a "LF" (U+000A) character, and every occurrence of a "LF" (U+000A) character not preceded by a "CR" (U+000D) character, by a two-character string consisting of a U+000D CARRIAGE RETURN "CRLF" (U+000A) character pair

    (IE doesn't quite conform to this exactly, as it treats LFCR as a single newline. But it's close enough.)