unicoderustnewlinecarriage-returnlinefeed

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?


Every programming language has their own interpretation of \n and \r. Unicode supports multiple characters that can represent a new line.

From the Rust reference:

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.

Based on that statement, I'd say a Rust character is a new-line character if it is either \n or \r. On Windows it might be the combination of \r and \n. I'm not sure though.

What about the following?

In my opinion, we are missing something like a char.is_new_line(). I looked through the Unicode Character Categories but couldn't find a definition for new-lines.

Do I have to come up with my own definition of what a Unicode new-line character is?


Solution

  • There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \r\r\n\n in multi-line-mode: Are there two lines (\r\r\n, \n), three lines (\r, \r\n, \n, like Unicode says) or four (\r, \r, \n, \n, like JS sees it)? Go and Python do not treat \r\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.

    So the takeaway here is

    If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t instead as well.

    Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \r\r\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)