pythonunicodefixed-length-record

Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError


I am trying to parse each line of a database file to get it ready for import. It has fixed width lines, but in characters, not in bytes. I have coded something based in Martineau's answer, but I am having trouble with the especial characters.

Sometimes they will break the expected width, some other times they will just throw UnicodeDecodeError. I believe the decode error could be fixed, but can I continue doing this struct.unpack and correctly decode the especial characters? I think the problem is that they are encoded in multiple bytes, messing up with the expected field widths, which I understand to be in bytes and not in characters.

import os, csv

def ParseLine( arquivo):
    import struct, string   
    format = "1x 12s 1x 18s 1x 16s"
    expand = struct.Struct(format).unpack_from
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
    for line in arquivo:
        fields = unpack(line)
        yield [x.strip() for x in fields]

Caminho = r"C:\Sample"
os.chdir(Caminho)

with open("Sample data.txt", 'r') as arq: 
    with open("Out" + ".csv", "w", newline ='') as sai: 
        Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
        for line in ParseLine(arq): 
            Write([line]) 

Sample data:

|     field 1|      field 2     |     field 3    |
| sreaodrsa  | raesodaso t.thl o| .tdosadot. osa |
| resaodra   | rôn. 2x  17/220V | sreao.tttra v  |
| esarod sê  | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |

Actual output:

field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x  17/22;V | sreao.tttra

In the output we see lines 1 and 2 are as expected. Line 3 got wrong widths, probably due to the multibyte ô. Line 4 throws the following exception:

Traceback (most recent call last):
  File "C:\Sample\FindSample.py", line 18, in <module>
    for line in ParseLine(arq):
  File "C:\Sample\FindSample.py", line 9, in ParseLine
    fields = unpack(line)
  File "C:\Sample\FindSample.py", line 7, in <lambda>
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
  File "C:\Sample\FindSample.py", line 7, in <genexpr>
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data

I will need to to perform especific operations on each field, so I can't use a re.sub on the whole file as I was doing before. I would like to keep this code, as it seems efficient and is in the brink of working. If there is some much more efficient way to parse, I could give it a try, though. I need to keep the special characters.


Solution

  • Indeed, the struct approach falls down here because it expects fields to be a fixed number of bytes wide, while your format uses a fixed number of codepoints.

    I'd not use struct here at all. Your lines are already decoded to Unicode values, just use slicing to extract your data:

    def ParseLine(arquivo):
        slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
        for line in arquivo:
            yield [line[s].strip() for s in slices]
    

    This deals entirely in characters in an already decoded line, rather than bytes. If you have field widths instead of indices, the slice() objects could also be generated:

    def widths_to_slices(widths):
        pos = 0
        for width in widths:
            pos += 1  # delimiter
            yield slice(pos, pos + width)
            pos += width
    
    def ParseLine(arquivo):
        widths = (12, 18, 16)
        for line in arquivo:
            yield [line[s].strip() for s in widths_to_slices(widths)]
    

    Demo:

    >>> sample = '''\
    ... |     field 1|      field 2     |     field 3    |
    ... | sreaodrsa  | raesodaso t.thl o| .tdosadot. osa |
    ... | resaodra   | rôn. 2x  17/220V | sreao.tttra v  |
    ... | esarod sê  | raesodaso t.thl o| .tdosadot. osa |
    ... | esarod sa í| raesodaso t.thl o| .tdosadot. osa |
    ... '''.splitlines()
    >>> def ParseLine(arquivo):
    ...     slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
    ...     for line in arquivo:
    ...         yield [line[s].strip() for s in slices]
    ... 
    >>> for line in ParseLine(sample):
    ...     print(line)
    ... 
    ['field 1', 'field 2', 'field 3']
    ['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa']
    ['resaodra', 'rôn. 2x  17/220V', 'sreao.tttra v']
    ['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa']
    ['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa']