pythonpandasdataframe

Dataframe column with quoted CSV to named dataframe columns


I am pulling some JSON formatted log data out of my SEIM and into a pandas dataframe. I am able to easily convert the JSON into multiple columns within the dataframe, but there is a "message" field in the JSON that contains a quoted CSV, like this.

# dummy data
dfMyData = pd.DataFrame({"_raw": [\
            """{"timestamp":1691096387000,"message":"20230803 20:59:47,ip-123-123-123-123,mickey,321.321.321.321,111111,10673010,type,,'I am a, quoted, string, with commas,',0,,","logstream":"Blah1","loggroup":"group 1"}""",
            """{"timestamp":1691096386000,"message":"20230803 21:00:47,ip-456-456-456-456,mouse,654.654.654.654,222222,10673010,type,,'I am another quoted string',0,,","logstream":"Blah2","loggroup":"group 2"}"""
            ]})
# Column names for the _raw.message field that is generated.
MessageColumnNames =  ["Timestamp","dest_host","username","src_ip","port","number","type","who_knows","message_string","another_number","who_knows2","who_knows3"]
# Convert column to json object/dict
dfMyData['_raw'] = dfMyData['_raw'].map(json.loads)
# convert JSON into columns within the dataframe
dfMyData = pd.json_normalize(dfMyData.to_dict(orient='records'))

I've seen this done before with str.split() to split on columns and then concat it back to the original dataframe, however the str.split method doesn't handle quoted values within the CSV. pd.read_csv can handle the quoted CSV correctly, but I can't figure out how to apply it across the dataframe and expand the output of that into new dataframe columns.

Additionally, when I split dfMyData['_raw.message'] out into new columns, I'd also like to supply a list of column names for the data and have the new columns be created with those names.

Anyone know of an easy way to split a quoted CSV string in a dataframe column into new named columns within the dataframe?


Solution

  • Update

    Actually all you need is to pass your data into a csv-reader, which in turn is an appropriate data type for pandas.DataFrame:

    pd.DataFrame(csv.reader(dfMyData['_raw.message'], quotechar="'"), columns=columns)
    

    Previous answer

    We can try to convert the data into a csv and read them back with appropriate parameters:

    import csv
    from tempfile import TemporaryFile
    
    seq = dfMyData.iloc[:,1]      # column of interest in the original data
    columns = [*'ABCDEFGHIJKL']   # custom names of future data columns
    
    with TemporaryFile() as file:
    
        seq.to_csv(
            file, 
            sep='\N{unit separator}',
            header=False,
            index=False,
            quoting=csv.QUOTE_NONE
        )
    
        file.seek(0)    # read data from the start
    
        df = pd.read_csv(
            file, 
            header=None,
            names=columns,
            quotechar="\'"
        )
    
    print(df)
    

    Notes:

    Transformed data:

    output