pythonpython-3.xcharacter-encodingpdbpython-unicode

Python 3 fails at pdb "b main" with UnicodeDecodeError?


The only similar question to this I've found is Django UnicodeDecodeError when using pdb - unfortunately, the solution there does not apply to this case.

Consider the following code, test.py:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# encoding: utf-8

def subtract(ina, inb):
  myresult = ina - inb
  return myresult

def main():
  y2 = 10
  y1 = 7
  # calculate (y₂-y₁)
  print("Calculating difference between y2: {} and y1: {}".format(y2, y1))
  result = subtract(y2, y1)
  print("The result is: {}".format(result))

if __name__ == '__main__':
  main()

Using Python3 from Anaconda3 on Windows 10:

(base) C:\tmp>conda --version
conda 4.7.12

(base) C:\tmp>python --version
Python 3.7.3

... I can run this program without a problem:

(base) C:\tmp>python test.py
Calculating difference between y2: 10 and y1: 7
The result is: 3

However, if I want to debug/step through this program using pdb, it fails as soon as I type b main to set a breakpoint on the main function:

(base) C:\tmp>python -m pdb test.py
> c:\tmp\test.py(6)<module>()
-> def subtract(ina, inb):
(Pdb) b main
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 648, in do_break
    lineno = int(arg)
ValueError: invalid literal for int() with base 10: 'main'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 659, in do_break
    code = func.__code__
AttributeError: 'str' object has no attribute '__code__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 1701, in main
    pdb._runscript(mainpyfile)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 1570, in _runscript
    self.run(statement)
  File "C:\ProgramData\Anaconda3\lib\bdb.py", line 585, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "c:\tmp\test.py", line 6, in <module>
    def subtract(ina, inb):
  File "c:\tmp\test.py", line 6, in <module>
    def subtract(ina, inb):
  File "C:\ProgramData\Anaconda3\lib\bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "C:\ProgramData\Anaconda3\lib\bdb.py", line 112, in dispatch_line
    self.user_line(frame)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 261, in user_line
    self.interaction(frame, None)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 352, in interaction
    self._cmdloop()
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 321, in _cmdloop
    self.cmdloop()
  File "C:\ProgramData\Anaconda3\lib\cmd.py", line 138, in cmdloop
    stop = self.onecmd(line)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 418, in onecmd
    return cmd.Cmd.onecmd(self, line)
  File "C:\ProgramData\Anaconda3\lib\cmd.py", line 217, in onecmd
    return func(arg)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 667, in do_break
    (ok, filename, ln) = self.lineinfo(arg)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 740, in lineinfo
    answer = find_function(item, fname)
  File "C:\ProgramData\Anaconda3\lib\pdb.py", line 100, in find_function
    for lineno, line in enumerate(fp, start=1):
  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 199: character maps to <undefined>
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> c:\programdata\anaconda3\lib\encodings\cp1252.py(23)decode()
-> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
(Pdb) q
Post mortem debugger finished. The test.py will be restarted
> c:\tmp\test.py(6)<module>()
-> def subtract(ina, inb):
(Pdb) q

(base) C:\tmp>

The problem is the comment line: # calculate (y₂-y₁); if it is deleted, then pdb starts fine:

(base) C:\tmp>python -m pdb test.py
> c:\tmp\test.py(6)<module>()
-> def subtract(ina, inb):
(Pdb) b main
Breakpoint 1 at c:\tmp\test.py:10
(Pdb) q

(base) C:\tmp>

I'm slightly surprised by this - wasn't Python3 supposed to be "utf-8 by default"?

Obviously, this is a trivial case where I can easily erase the single comment line that causes the trouble. However, I have a large script, where I have utf-8 characters all over the place, both in comments, and in prints I'd actually want to step through, and it is not really viable to go in and manually change all those instances to UTF-8 characters.

So, is there a way to cheat Python3's pdb, so it works - even if there are utf-8 characters present in the source code (regardless if in comments, or in actual commands)?


Solution

  • Python 3 is UTF-8 by default, but the environment in which it is operating is not - it has a default encoding of cp1252.

    You can set the PYTHONIOENCODING environment variable to UTF-8 to override the default encoding, or change the environment to use UTF-8.

    Edit

    I analysed this too hastily. The above solutions apply to fixing unicode errors raised when reading or writing from stdin/stdout, but the problem here is that pdb opens a file for reading without specifying an encoding:

    def find_function(funcname, filename):
        cre = re.compile(r'def\s+%s\s*[(]' % re.escape(funcname))
        try:
            fp = open(filename)
        except OSError:
            return None
    

    If no encoding is specified, according to the io docs Python will default to using the result of locale.getpreferredencoding - presumably cp1252 in this case.

    One solution might be to set the console locale before running the debugger.

    It may also be possible to set the PYTHONUTF8 environment variable to 1. Amongst other things, this will cause

    open(), io.open(), and codecs.open() use the UTF-8 encoding by default.

    Since I originally answered this question, the behaviour has been changed to use the encoding specified in the source file's encoding cookie, if present, falling back to UTF-8.