I need help in parsing below text into Pandas dataframe columns
(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs
Below is the expected output.
Column 1 | Column 2 | Column 3 | Column 4 |
---|---|---|---|
(11/15/22 2:48:46 PM EST) | 14:48:33.671 | 42300146 | GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal. |
(11/15/22 11:55:59 AM EST) | 11:40:22.750 | 31009225 | GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots |
(11/8/22 8:03:39 AM EST) | 07:32:38.188 | 28624922 | Timed out waiting for incoming connections, retry after couple of secs |
Note: The first and fourth column have different width.
I have tried to use read_fwf with specific width but didn't produce the output that I want. I couldn't use read_csv since there is no delimiter (other than whitespace) between columns.
Any help would be appreciated!
Thank you in advance.
You can use regular expression to parse the text (regex demo.)
text = """\
(11/15/22 2:48:46 PM EST) 14:48:33.671 - 42300146: GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
(11/15/22 11:55:59 AM EST) 11:40:22.750 - 31009225: GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
(11/8/22 8:03:39 AM EST) 07:32:38.188 - 28624922: Timed out waiting for incoming connections, retry after couple of secs"""
import re
import pandas as pd
pat = re.compile(r"(\(.*?\))\s+(\S+)\s+-\s+(\d+):\s+(.*)")
data = []
for line in text.splitlines():
m = pat.search(line)
if m:
data.append(m.groups())
df = pd.DataFrame(
data, columns=["Column 1", "Column 2", "Column 3", "Column 4"]
)
print(df)
Prints:
Column 1 Column 2 Column 3 Column 4
0 (11/15/22 2:48:46 PM EST) 14:48:33.671 42300146 GWS: 2022-11-15 18:18:33.638 RESTQueue No request to signal.
1 (11/15/22 11:55:59 AM EST) 11:40:22.750 31009225 GWS: 2022-11-15 15:10:22.748 Message Type Received: Slots
2 (11/8/22 8:03:39 AM EST) 07:32:38.188 28624922 Timed out waiting for incoming connections, retry after couple of secs