pythonnewline

What is the difference between a newline character and actually writing something in a newline?


See the following:

Case 1:

Hi\nHello

Case 2:

Hi
Hello

Both of them are found in text files. So, I did some experiments on them to figure out if there are any differences between them:

In [1]: print("""Hi\nHello""")
Hi
Hello

In [2]: print("""Hi
   ...: Hello""")
Hi
Hello

In [3]: s1 = """Hi\nHello"""

In [4]: s2 = """Hi
   ...: Hello"""

In [5]: repr(s1)
Out[5]: "'Hi\\nHello'"

In [6]: repr(s2)
Out[6]: "'Hi\\nHello'"

In [7]: s1.split()
Out[7]: ['Hi', 'Hello']

In [8]: s2.split()
Out[8]: ['Hi', 'Hello']

In [9]: list(s1.encode("utf-8"))
Out[9]: [72, 105, 10, 72, 101, 108, 108, 111]

In [10]: list(s2.encode("utf-8"))
Out[10]: [72, 105, 10, 72, 101, 108, 108, 111]

I cannot absolutely find any difference between them. Now if they are the same, then how they are represented internally so that they can be shown differently on text?

Update: If you create a text file with the following content:

Hi\nHello
Hi
Hello

And read it:

In [1]: with open("newline.txt", "r") as f:
   ...:     data = f.read()
   ...: 

In [2]: data
Out[2]: 'Hi\\nHello\nHi\nHello'

Then you will see the difference (\\n vs \n).

But when strings from that file are copied and pasted in a code, they behave the same:

In [4]: """Hi\nHello
   ...: Hi
   ...: Hello"""
Out[4]: 'Hi\nHello\nHi\nHello'

Here, both of them are \n.


Solution

  • There is no difference between a escaped n, writen as "\n" inside a string and a literal new line character (Unicode Character with codepoint 10 decimal) inside a proper string (but for the fact the later have to be typed inside triple quotes, which support containing such control character).

    That is easily verifiable by doing this:

    >>> s1 = "a\nb"
    >>> s2 = """a
    ... b"""
    >>> s1
    'a\nb'
    >>> s2
    'a\nb'
    >>> s1 == s2
    True
    

    Note that other control characters will behave just the same way. iPython or the Python repl won't allow you to type a literal <tab> (unicode control point 9) characte - but if you paste such a character from other program inside a quoted string, you could see Python rendering it as the \t sequence when using repr and as some whitespace when printing the string.

    In this case, I am typing tab as the \t sequence:

    >>> s3 = "a\tb"
    >>> s3
    'a\tb'
    >>> print(s3)
    a       b
    >>> s3.encode()
    b'a\tb'
    >>> s3.encode()[1]
    9
    

    The multiline string syntax allowed with the triple quotes was introduced in the language LONG ago in order to make it easier to create multi-line text. Some code-styles, or teams could possibly avoid it (it is hard to combine, in a larger source-code file, multi-line text inside triple quotes with proper indentation one expects in a Python block) (*). If it is a convention style in your project, the team might prefer to use the "auto-concatenate consecutive strings" and use single-quoted strigns, grouped by parentheses, so each single-quoted string can be in a separated line, and explicit "\n" characters, to convey the same information. Say, some Python source has an SQL snippet embedded - it could be written like this:

    query = (
        "SELECT id, name \n"
        "FROM my_table\n"
        "WHERE name=%s"
    )
     
    

    In contrast to writting:

    query = """\
        SELECT id, name
        FROM my_table
        WHERE name=%s
    """
    
    

    (Note that these two are not equal strings, although they are equivalent SQL: the second snippet will include the 4-spaces used in the indentation before each sql snippet)

    Also, note the sole \ character after the opening triple-quotes: that will escape the literal (unicode 10) newline itself, and it will not be contained in the second string:

    >>> s3 = """\
    ... a
    ... b
    ... """
    >>> s3
    'a\nb\n'
    >>> s4 = """
    ... a
    ... b
    ... """
    >>> s4
    '\na\nb\n'
    

    Getting back to your examples in the question:

    The interactive environment naturally shows us the repr of an expression result. That implies that when one use the repr call, like on your on following up example, Python renders the two backslash characters, so that the string returned by repr, when printed, will show how the original string should be typed to be valid Python code.

    Let that sink in.

    >>> s1 = "a\nb"
    >>> s1 # the interactive enviroment will output the repr of s1
    'a\nb'
    >>> repr(s1)  # the interactive enviromnet will output "the repr of the repr of s1"
    "'a\\nb'"
    >>> print(repr(s1)) # this will print "the repr of s1"
    'a\nb'
    >>> s2 = 'a\nb' # here, I pasted the "printed repr of s1" as the content of s2,
    >>> s1 == s2  # and, as can visually be seem, we are back at the beggining
    True
    

    Saying it again: The return of repr is a string that when printed will show a valid string that when printed back can be pasted as Python source.

    (If one where to type the two backslashes before the n in a string, they d be escaping the backslash itself - the contents of the "\\n" strings are two characters: the literal backslash (unicode character with codepoint 92 / 0x5c) and the lower-case n character.)

    As for the last part of your question:

    Now if they are the same, then how they are represented internally so that they can be shown differently on text?

    They are the same internally, and are not shwoun differently on text: there are two (or even more) different ways of typing the same string in Python source code - as you found out - but once parsed into internal objects, they are the same string. Then there are two different ways of outputting the same string: one is using str() (which the print function will use implicitly), and the other is using repr(), which will render otherwise invisible control codes, like the codepoint 10 - newline in a canonical way it can be typed into Python source: the \n representation.

    (*) There is an ongoing discussion about this in Python Ideas, and if it gains some more traction we could have a prefix to triple quotes which would remove common whitespace from lines inside triple-quoted strings. But currently the discussion is a bit stalled.