I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.
# dummy data
dfMyData = pd.DataFrame({"_raw": [\
"""{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,,","logstream":"Blah1","loggroup":"group 1"}""",
"""{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,,","logstream":"Blah2","loggroup":"group 2"}"""
]})
# Column names for the _raw.message field that is generated.
MessageColumnNames = ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))
I've seen this done before with str.split()
to split on columns and then concat it back to the original dataframe, however the str.split
method doesn't handle quoted values within the CSV. pd.read_csv
can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.
Additionally, when I split dfMyData['_raw.message']
out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.
Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?
Update
Actually all you need is to pass your data into a csv-reader, which in turn is an appropriate data type for pandas.DataFrame
:
pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)
Previous answer
We can try to convert the data into a csv and read them back with appropriate parameters:
import csv
from tempfile import TemporaryFile
seq = dfMyData.iloc[:,1] # column of interest in the original data
columns = [*'ABCDEFGHIJKL'] # custom names of future data columns
with TemporaryFile() as file:
seq.to_csv(
file,
sep='\N{unit separator}',
header=False,
index=False,
quoting=csv.QUOTE_NONE
)
file.seek(0) # read data from the start
df = pd.read_csv(
file,
header=None,
names=columns,
quotechar="\'"
)
print(df)
Notes:
quoting=csv.QUOTE_NONE
to avoid \"
at the ends of each linesep='\N{unit separator}'
to avoid confusion with commasquotechar="\'"
when reading back because of a specific quoting inside the lines'\N{unit separator}'
delimiter will never make it into the final dataTransformed data: