I've imported a tab-delimited file with Pandas read_csv
in Jupyter notebook (Python 2), and I've extracted the single column of interest:
rawData = pd.read_csv(filename, delim_whitespace = True, header = 20)
columnOfInterest = rawData.ix[:, 9]
The format of my column of interest is like so:
header1=123;header2=123;header3=123
Not every row in this DataFrame has every header, and I don't know the full set of possible headers. The "123"s, my data values, are all numbers.
After splitting the elements in the column using ;
as my delimiter, all of my rows have a number of columns equal to the number of values in the row, which is not uniform across the dataset (ragged). I want convert this to a matrix with missing values.
What I would like to do is to take each row from my DataFrame, extract the header information, and if the header label is new (i.e. it isn't present in any of the rows already processed) then I'd like to add it to my list of column names. Of course, I'd like the header names and equal signs removed from the rows, and I would like my data to all be in its proper place (so, use the header info attached to each data value to place values in the proper columns). So, I'd like something that looks like this:
# Original data frame, first 2 rows
['header1=123', 'header2=123', 'header3=123'] # <--- no header4
['header1=123', 'header3=123', 'header4=123'] # <--- no header2
# New data frame, first 2 rows plus column names
header1 header2 header3 header4
123 123 123 null # <--- header4 == null
123 null 123 123 # <--- header2 == null
Obviously, this seems like a job for a regular expression! However, I'm at a loss as to how to go about this in Pandas. The missing data should be null.
If you have dataframe like
df = pd.DataFrame([['header1=123', 'header2=123', 'header3=123'],['header1=123', 'header3=123', 'header4=123']])
Then, you can split the data by =
and then create a dictionary and pd.DataFrame constructor will take care of the rest i.e
new = [[j.split('=') for j in i] for i in df.values ]
di=[{k:j for k,j in i} for i in new]
new_df = pd.DataFrame(di)
Output :
Dict :
[ {'header1': '123', 'header2': '123', 'header3': '123'}, {'header1': '123', 'header3': '123', 'header4': '123'}]
DataFrame:
header1 header2 header3 header4 0 123 123 123 NaN 1 123 NaN 123 123
Hope it helps