pythonperformanceencodingutf-8fixed-width

Dealing with non-ASCII characters when parsing fixed-width txt file


So I have a series of huge files (several GB) of tabular data. They are txt and each column is defined by a fixed width. This width is indicated by a number of dashes right below the headers. So far so good, I have a script that reads those files line by line and outputs them to XML.

One challenge is that most but NOT all of the content is encoded in UTF-8. Trying to decode the content while processing will throw an error somewhere down the line. Hence, my script only reads and processes byte strings. This will cause readability issues in the output but that's tolerable and not my concern.

My problem: The widths were calculated with the decoded content in mind. Non-ascii characters that are represented by several bytes in UTF-8 are not accounted for.

Example: The string ´Zürich, Albisgütli´ has a length of 18 and is found in a column with a fixed width of 19. In its UTF8 representation, however, the string is ´Z\xc3\xbcrich, Albisg\xc3\xbctli´ which is 20 chars long and thus will throw off the parsing of the rest of the data row.

Solution attempts so far:

One issue is also that I'm using copied code for the parsing so I don't know how I could change its behavior to count non-Ascii chars differently.

Thankful for any pointers what a possible approach could be!

Code as it is now:

def convert(infile, outfile):
    secondline = infile.readline() # the dashes are in this line
    maxlen = len(secondline)
    fieldwidths = get_widths(secondline) # counts the dashes to get the widths

    # code taken from: https://stackoverflow.com/a/4915359/9021715
    fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
    fieldstruct = struct.Struct(fmtstring)
    parse = fieldstruct.unpack_from
    
    c = 0
    
    outfile.write(b"<?xml version='1.0' encoding='UTF-8'?>\n")
    
    namespace = f'xmlns="http://www.bar.admin.ch/xmlns/siard/2/table.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bar.admin.ch/xmlns/siard/2/table.xsd {table_w_num}.xsd" version="2.1"'.encode()
    
    outfile.write(b'<table ' + namespace + b'>\n')
    for line in infile:
        diff = maxlen - len(line)
        padded_line = bytearray()
        padded_line += line
        for _ in range(diff):
            padded_line += b' '

        data = [elem.strip() for elem in parse(padded_line)]
        data = parse(padded_line)
        
        if b"Albis" in line:
            print(line)
            print(data)
        row = b''
        for elem, n in zip(data, range(1, len(data)+1)):
            # Timestamp-Fix
            elem = re.sub(b"(\d{4}\-\d{2}\-\d{2}) (\d{2}:\d{2}:\d{2}(\.\d+)?)\S*?", b"\g<1>T\g<2>Z", elem)
            if elem == b'' or elem == b'NULL':
                pass
            else:
                row = b'%s<c%s>%s</c%s>' % (row, str(n).encode(), xml_escape(elem), str(n).encode())
        row = b"<row>%s</row>" % (row)
        outfile.write(b''.join([row, b'\n']))
        
        c += 1
        if c % infostep == 0:
            timestamp = int(time.time() - start_time)
            print(f"Quarter done, time needed: {str(timestamp)} seconds")
    
    outfile.write(b'</table>')

EDIT:

Now trying to get away from my handwritten and likely-to-fail code, but the problem from the second paragraph is kicking in. After a few thousand rows suddenly I find this:

b'Z\xfcrich'

This decodes well in ANSI/Windows-1252. Running the ftfy library on the whole file first somehow didn't catch this. I'm hesitant to write chaotic code with a bunch of try except loops that try to decode the lines. I don't even know if whole lines are suddenly in ANSI or just single fields.


Solution

  • The first snippet in the related answer only works for single-byte codepages because it counts bytes, not characters. It doesn't even work for UTF16 which usually uses 2 bytes per character and certainly not UTF8 which uses a variable number of bytes. That was pointed out in the comments.

    The same answer shows how to handle UTF8 in the third snippet:

    from itertools import accumulate, zip_longest
    def make_parser(fieldwidths):
        cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
        pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
        flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
        slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
        parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
        # Optional informational function attributes.
        parse.size = sum(abs(fw) for fw in fieldwidths)
        parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                    for fw in fieldwidths)
        return parse
    

    Using it on a string including the question's text produces the expected results. Note that the original string has 18, not 19 characters :

    >>> parser = make_parser([18,3,4])
    >>> line="Zürich, Albisgütli3456789"
    >>> parser(line)
    ('Zürich, Albisgütli', '345', '6789')
    

    And to prove this really does work with UTF8 text files:

    with open('testutf8.txt',encoding="utf-8",mode='w') as f:
        for i in range(3):
            f.write(line)
            f.write('\n')
    
    with open('testutf8.txt',encoding="utf-8",mode='r') as f1:
        for line in f1.readlines():
            print(parser(line))
    -------
    ('Zürich, Albisgütli', '345', '6789')
    ('Zürich, Albisgütli', '345', '6789')
    ('Zürich, Albisgütli', '345', '6789')
    

    The question's code should be split into separate functions, one to read the data and another to generate the XML output. Both operations are already available through various modules though. There are several modules that can read fixed-width files, several XML parser and serialization libraries, and some, like Pandas, can read multiple data formats, process the data and export it as XML.

    Using Pandas

    Pandas is one of the most popular if not the most popular module for analytics and data science. It's a great tool for data processing too. It's part of Python distributions created for analysis, like Anaconda.

    For example, with Pandas, this code could be replaced with just 2 function calls:

    import pandas as pd
    namespaces={ 
        "xmlns" : "http://www.bar.admin.ch/xmlns/siard/2/table.xsd" ,
        "xsi" : "http://www.w3.org/2001/XMLSchema-instance" ,
        ...
    }    
    
    df=pd.read_fwf('data.csv')
    df.to_xml('data.xml', root_name='table', namespaces=namespaces)
    

    This will be faster and use less memory than the explicit string manipulations in the question's code. String manipulations create new temporary strings each time, costing in both CPU and RAM.

    By default read_fwf will try to infer the column widths based on the first 100 rows. You can increase the number of rows with the infer_nrows parameter or specify a list of (from,to) tuples with the colspecs parameter, eg colspecs=[(1,3),(3,5),(10,14)]

    to_xml offers several parameters for controlling the XML output, like namespaces, what element name to use for the root and rows, which columns to output as attributes and which as child elements, etc. It can even write to a compressed file

    Attribute names can be specified through the attr_cols parameter, eg:

    df.to_xml(attr_cols=[
              'index', 'shape', 'degrees', 'sides'
              ]) 
    

    You can also rename Dataframe columns, change their type eg to parse string fields into dates or numbers:

    df['Timestamp'] = pd.to_datetime(df['Col3'],infer_datetime_format=True)
    df=df.rename(columns={'Col1':'Bananas',...})
    

    Using Standard Library xml modules

    Even if you can't use Pandas to solve the entire problem in a couple of lines, you can use one of Python's xml processing modules. These are part of the Python Standard Library which means they're available in all distributions

    The Building XML Documents example of the ElementTree module shows how to create XML docs programmatically :

    >>> a = ET.Element('a')
    >>> b = ET.SubElement(a, 'b')
    >>> c = ET.SubElement(a, 'c')
    >>> d = ET.SubElement(c, 'd')
    >>> ET.dump(a)
    <a><b /><c><d /></c></a>
    

    Corporate Policies

    That's not a logical argument. It may sound so to non-technical managers with limited security experience, certainly limited Python experience. Or C#, Go, JavaScript experience. Security is on of the main reason even "standard" libraries/modules/assemblies are distributed and updated through package managers. Nobody wants (or can afford) to wait 2 years for security and bug fixes any more.

    The Python Standard Library contains modules that started as third-party packages. That's certainly the case with ElementTree. In many cases the docs advise using external packages to handle more complex cases.

    Besides, Pandas is almost certainly used in the company already. It's one of the most common if not the most common library in analytics and data science. In fact, distributions like Anaconda already include it.

    Does whoever wrote that policy understand they're telling you to keep using already disclosed security vulnerabilities and conciously deploy insecure code? Because that's what you get if you don't upgrade packages. And every Python distribution comes with a lot of packages that require updating after a while.

    Finally, with your hand-written code you have to reinvent and support code that's already written, tested and vetted for several years in various packages. That's time that's not spent producing valuable, billable work for the company.

    A rule of thumb is that a company needs to make 2-3x your gross salary to justify the work. How much will be the cost of hand-written text reading and XML generation over the product's lifetime, when you include bug fixes and downtime incurred due to those bugs?