Given a stream of bytes (generator, file, etc.) how can I read a single utf-8
encoded character?
I could approach this by rolling my own utf-8
decoding function but I would prefer not to reinvent the wheel since I'm sure this functionality must already be used elsewhere to parse utf-8
strings.
Wrap the stream in a TextIOWrapper
with encoding='utf8'
, then call .read(1)
on it.
This is assuming you started with a BufferedIOBase
or something duck-type compatible with it (i.e. has a read()
method). If you have a generator or iterator, you may need to adapt the interface.
Example:
from io import TextIOWrapper
with open('/path/to/file', 'rb') as f:
wf = TextIOWrapper(f, 'utf-8')
wf._CHUNK_SIZE = 1 # Implementation detail, may not work everywhere
wf.read(1) # gives next utf-8 encoded character
f.read(1) # gives next byte