pythonpandaspython-2.7csvheader-row

Extract custom header column names from DataFrame, impute missing columns with NA


I've imported a tab-delimited file with Pandas read_csv in Jupyter notebook (Python 2), and I've extracted the single column of interest:

rawData = pd.read_csv(filename, delim_whitespace = True, header = 20)
columnOfInterest = rawData.ix[:, 9] 

The format of my column of interest is like so:

header1=123;header2=123;header3=123

Not every row in this DataFrame has every header, and I don't know the full set of possible headers. The "123"s, my data values, are all numbers.

After splitting the elements in the column using ; as my delimiter, all of my rows have a number of columns equal to the number of values in the row, which is not uniform across the dataset (ragged). I want convert this to a matrix with missing values.

What I would like to do is to take each row from my DataFrame, extract the header information, and if the header label is new (i.e. it isn't present in any of the rows already processed) then I'd like to add it to my list of column names. Of course, I'd like the header names and equal signs removed from the rows, and I would like my data to all be in its proper place (so, use the header info attached to each data value to place values in the proper columns). So, I'd like something that looks like this:

# Original data frame, first 2 rows
['header1=123', 'header2=123', 'header3=123'] # <--- no header4
['header1=123', 'header3=123', 'header4=123'] # <--- no header2

# New data frame, first 2 rows plus column names
header1    header2    header3    header4 
123        123        123        null    # <--- header4 == null
123        null       123        123     # <--- header2 == null

Obviously, this seems like a job for a regular expression! However, I'm at a loss as to how to go about this in Pandas. The missing data should be null.


Solution

  • If you have dataframe like

    df = pd.DataFrame([['header1=123', 'header2=123', 'header3=123'],['header1=123', 'header3=123', 'header4=123']])
    

    Then, you can split the data by = and then create a dictionary and pd.DataFrame constructor will take care of the rest i.e

    new = [[j.split('=') for j in i] for i in df.values ]
    
    di=[{k:j for k,j in i} for i in new]
    
    new_df = pd.DataFrame(di)
    

    Output :

    Dict :

    [ {'header1': '123', 'header2': '123', 'header3': '123'},
     {'header1': '123', 'header3': '123', 'header4': '123'}]
    
    

    DataFrame:

      header1 header2 header3 header4
    0     123     123     123     NaN
    1     123     NaN     123     123
    

    Hope it helps