python buffer-overflow penetration-testing

While scanning for badchars to avoid in a buffer overflow attack, hex number "C2" keeps appearing every second character in the hexdump

I'm learning about buffer overflows because I have an exam on it tomorrow. I've been following this guide, and I'm currently on the step where I'm using immunity debugger to look for badchars. However, a weird problem is occurring where after I get to the hex number "7F", where for some reason every second number appears to be "C2".

I'm using a script which is slightly modified from the guide since I'm using python3 instead of python2. That script looks like this:

import socket

ip = "192.168.10.136"
port = 31337
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((ip, port))

offset = 146
eip = "B" * 4
allchars = "\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2a\x2b\x2c\x2d\x2e\x2f\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3a\x3b\x3c\x3d\x3e\x3f\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4a\x4b\x4c\x4d\x4e\x4f\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5a\x5b\x5c\x5d\x5e\x5f\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6a\x6b\x6c\x6d\x6e\x6f\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7a\x7b\x7c\x7d\x7e\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
buffer = "A" * offset + eip + allchars
buffer += "\n"
s.sendall(buffer.encode('utf-8'))

I couldn't find anything online, it was a kind of hard problem to troubleshoot, especially since I'm not super experienced with python3 or immunity debugger. Any help is very appreciated, and I'll do my best to answer any questions if there's any important information I've forgotten to include.

Solution

Comment promoted to answer at OP's request.

As @Sören points out, you have encoded your original data as UTF-8. Your original data goes from byte value 01 to FF which, because it is in a string, represents the Unicode codepoints U+0001 to U+00FF.

But the values in the second half of that block, Unicode codepoints U+0080 to U+00FF, are represented in UTF-8 as two bytes. So when you encoded the original data 0x80 0x81 etc as UTF-8 you got the UTF-8 2-byte representations C2 80 C2 81 etc.

To fix: Make allchars a bytestring b"..." and use it as is without encoding.

Unicode is a 21-bit system that can accommodate a theoretical 1,114,112 different codepoints (though only 144,697 of them were assigned as of Unicode 14.0, in 2021). UTF-8 does represent some codepoints as a single byte: essentially, the 127 characters of the ASCII character set from 1963. But it follows from number of codepoints that any representation in 8-bit units will be variable width, with some codepoints occupying two, three or even four bytes.