python-3.xutf-8utf8-decode

Read utf-8 character from byte stream


Given a stream of bytes (generator, file, etc.) how can I read a single utf-8 encoded character?

I could approach this by rolling my own utf-8 decoding function but I would prefer not to reinvent the wheel since I'm sure this functionality must already be used elsewhere to parse utf-8 strings.


Solution

  • Wrap the stream in a TextIOWrapper with encoding='utf8', then call .read(1) on it.

    This is assuming you started with a BufferedIOBase or something duck-type compatible with it (i.e. has a read() method). If you have a generator or iterator, you may need to adapt the interface.

    Example:

    from io import TextIOWrapper
    
    with open('/path/to/file', 'rb') as f:
      wf = TextIOWrapper(f, 'utf-8')
      wf._CHUNK_SIZE = 1  # Implementation detail, may not work everywhere
    
      wf.read(1) # gives next utf-8 encoded character
      f.read(1)  # gives next byte