Is it possible to specify the replacement character used by str(xxx,encoding='utf-8', errors='replace')
to be something other than the diamond-question-mark character (�)?
I am attempting to fix up a GPS data filtering routine in Python for my GoPiGo robot.
The GPS module I am using returns NMEA "GPxxx" sentences as pure 8-bit ASCII values beginning with "$GPxxx" where "xxx" is a three character code for the type of data in the sentence.
For example GPS NMEA serial data sentences might look something like this:
$GPRMC,100905.00,A,5533.07171,N,03734.91789,E,2.657,,150325,,,A*76
$GPVTG,,T,,M,2.657,N,4.920,K,A*2A
$GPGGA,100905.00,5533.07171,N,03734.91789,E,1,04,2.81,183.7,M,13.4,M,,*58
$GPGSA,A,3,12,29,20,06,,,,,,,,,8.80,2.81,8.34*0A
$GPGSV,3,1,10,04,04,007,,05,08,138,,06,24,046,14,11,49,070,*77
$GPGSV,3,2,10,12,48,134,18,20,28,107,08,25,78,210,,28,38,290,11*7E
$GPGSV,3,3,10,29,49,251,14,31,24,315,*7E
$GPGLL,5533.07171,N,03734.91789,E,100905.00,A,A*69
Right now I am using str() to read the serial data, convert it to UTF-8 from raw ASCII, and print it. Sometimes the initial read generates some garbage characters which throw an exception, so I use errors='ignore'
to prevent this:
self.raw_line = str(self.ser.readline().strip(),encoding='utf-8', errors='ignore')
The result is that when I start a serial communication session, some of the first characters read into the input stream are "garbage" characters. This appears to be a characteristic of the way Python reads the stream as putty in Windows and miniterm in Linux don't show these characters.
Reading GPS sensor for location . . .
If you are not seeing coordinates appear, your sensor needs to be
outside to detect GPS satellites.
Reading GPS sensor for location . . .
JSSH
ubbbbrb95)$GPRMC,131435.00,V,,,,,,,150325,,,N*7C
$GPVTG,,,,,,,,,N*30
$GPGGA,131435.00,,,,,0,00,99.99,,,,,,*67
$GPGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99*30
$GPGSV,3,1,12,05,36,054,,07,03,359,,13,10,076,,15,13,113,*75
...where JSSHubbbbrb95)
represents a string of nonsensical characters that appear before the start of valid data.
What I want to do is replace invalid characters with "" instead of the diamond-question-mark.
I understand that I can filter them out. However it would be much more convenient if I could replace them with "nothing" at the time they're read.
Is it possible to specify a different replacement character when using errors='replace'
?
I don't think you can replace errors with a custom character. However, you can make a custom error handler to replace invalid characters with a string of your choice.
You're going to need the codecs
library to do that:
import codecs
def remove_invalid_bytes(error):
# error is a UnicodeDecodeError
# return a tuple: (replacement string, position to continue)
return ("", error.end)
# register the custom error handler
codecs.register_error("remove", remove_invalid_bytes)
# now use the custom error handler when decoding
raw_bytes = b"some invalid data: \xff\xfe$GPRMC,..."
decoded_string = str(raw_bytes, encoding="utf-8", errors="remove")
print(decoded_string)
This should output something like this : some invalid data: $GPRMC,...