pythoncastingurllibunicode-stringpython-3.10

Strange character added when decoding with urllib


I'm trying to parse a query string like this: filename=logo.txt\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x01x&filename=.hidden.txt

Since it mixes bytes and text, I tried to alter it such that it will produce the desired escaped url output like so:

    extended = 'filename=logo.txt\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x01x&filename=.hidden.txt'
    fixbytes = bytes(extended, 'utf-8')
    fixbytes = fixbytes.decode("unicode_escape")
    algoext = '?' + urllib.parse.quote(fixbytes, safe='?&=')

This outputs b'filename=logo.txt\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x01x&filename=.hidden.txt'

filename=logo.txtx&filename=.hidden.txt

?filename=logo.txt%C2%80%00%00%00%00%00%00%00%00%00%00%00%00%00%00%01x&filename=.hidden.txt

Where does the %C2 byte come from? It's not in the source string and it's not in any of the intermediate steps. What could I do other than manually remove it from the final output string?

P.S. I'm relying on a library to generate the string so changing the way it's represented initially is not an option.


Solution

  • Also achieves my goal:

    querystring = '?' + extended.replace('\\x', '%')