pandasdataframeparsingtextcolumn-width

Parsing Text file with irregular column width into Pandas


I need help in parsing below text into Pandas dataframe columns

(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748  Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs

Below is the expected output.

Column 1 Column 2 Column 3 Column 4
(11/15/22 2:48:46 PM EST) 14:48:33.671 42300146 GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 31009225 GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 28624922 Timed out waiting for incoming connections, retry after couple of secs

Note: The first and fourth column have different width.

I have tried to use read_fwf with specific width but didn't produce the output that I want. I couldn't use read_csv since there is no delimiter (other than whitespace) between columns.

Any help would be appreciated!

Thank you in advance.


Solution

  • You can use regular expression to parse the text (regex demo.)

    text = """\
    (11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
    (11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748  Message Type Received: Slots
    (11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs"""
    
    import re
    import pandas as pd
    
    pat = re.compile(r"(\(.*?\))\s+(\S+)\s+-\s+(\d+):\s+(.*)")
    
    data = []
    for line in text.splitlines():
        m = pat.search(line)
        if m:
            data.append(m.groups())
    
    df = pd.DataFrame(
        data, columns=["Column 1", "Column 2", "Column 3", "Column 4"]
    )
    print(df)
    

    Prints:

                         Column 1      Column 2  Column 3                                                                Column 4
    0   (11/15/22 2:48:46 PM EST)  14:48:33.671  42300146            GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
    1  (11/15/22 11:55:59 AM EST)  11:40:22.750  31009225              GWS: 2022-11-15 15:10:22.748  Message Type Received: Slots
    2    (11/8/22 8:03:39 AM EST)  07:32:38.188  28624922  Timed out waiting for incoming connections, retry after couple of secs