pythonpandasdataframefix-protocol

Normalize FIX logs


I have a log file (FIX) that I'm trying to convert to csv with headers for example:

8=FIX.4.2|9=435|35=8|34=8766|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8767|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8768|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|

Looking to normalize into a csv as follows:

8,9,35,34,49,50,52,56
FIX.4.2,435,8,8766,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8,8767,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
FIX.4.2,435,8,8768,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID

Reading the file log:

with open(some.txt) as file:
        data = file.read()

fix = pd.read_csv(data,sep='|')
print(fix)

I have some regex for the headers and parsing the data:

# regexforheader = re.compile("(?<=\|)(.*?)(?==)")
# regexRowData = re.compile="(?<=\=)(.*?)(?=\|)"

Stumped how to put it all together.

edit1: throws an error when new field is introduced:

8=FIX.4.2|9=435|35=8|34=8766|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8767|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8768|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|
8=FIX.4.2|9=435|35=8|34=8768|49=SENDERCOMPID|50=ET1|52=20230228-14:31:17.796|56=TARGETCOMPID|109=test|

Solution

  • With , you can use str.split and a pivot:

    df = pd.read_csv('in.csv', sep='|', header=None)
    
    (df.stack().str.split('=', n=1, expand=True)
       .droplevel(1).pivot(columns=0, values=1)
       .sort_index(axis=1, key=lambda x: x.astype(int))
       .to_csv('out.csv', index=False)
    )
    

    Using the csv module (and assuming the first line defines all the columns):

    import re
    import csv
    
    with open('in.csv') as f_in, open('out.csv', 'w') as f_out:
        first = True
        for l in csv.reader(f_in, delimiter='|'):
            d = dict(x.split('=') for x in l if x)
            if first:
                writer = csv.DictWriter(f_out, d.keys(), delimiter=',')
                writer.writeheader()
                first = False
            writer.writerow(d)
    

    Output:

    8,9,34,35,49,50,52,56
    FIX.4.2,435,8766,8,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
    FIX.4.2,435,8767,8,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
    FIX.4.2,435,8768,8,SENDERCOMPID,ET1,20230228-14:31:17.796,TARGETCOMPID
    

    timings

    on 300k rows of input:

    # pandas version
    3.36 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # csv version
    1.61 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)