I have a CSV file and it's pretty messy. The first column is fine, but all the rest of the data is in the second column. All data as VariableName1=Variable1, VariableName2=Variable2, VariableName3=Variable3
, ... are in the second column.
<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre> var1 var2 \
1 SfgvbdvbUJ05-1 var3=10,var4=/a/n/anghelo_rujo_edited-...
2 OLBCANGR15 var3=10,var4=/c/a/cangrande_test.jpg,a...
3 ZAMdvFIA19 var3=10,var4=/p/i/pierluigi_zampaglion...
4 VINMUL18 var3=10,var4=/r/u/rudi_vindimian_mulle...
5 PRACLA16 var3=10,var4=/p/r/pracla16_podere_prad...
.. ... ...
175 WALLIM var3=25,var4=/w/a/walcher_limoncello_w...
239 SMROS20 var3=10,var4=/s/e/sella_e_mosca_rosato...
288 SAELAMB19 var3=10,var6=Modena,bottleml=750,box_size=1...
343 DILABB var3=40,var4=/d/i/dilabb_distillerie_l...
357 VANER19 var3=10,var4=/v/a/valdibella_kerasos_v...
var4 ... var9 var10 var11
1 NaN ... NaN NaN NaN
2 NaN ... NaN NaN NaN
3 NaN ... NaN NaN NaN
4 NaN ... NaN NaN NaN
5 NaN ... NaN NaN NaN
.. ... ... ... ... ...
175 NaN ... NaN NaN NaN
239 NaN ... NaN NaN NaN
288 NaN ... NaN NaN NaN
343 NaN ... NaN NaN NaN
357 NaN ... NaN NaN NaN
</pre>
</div>
I took the second column as separate new data and split it with ,
. But I can't separate the VariableName1=Variable1
data into VariableName
columns.
When I do it with String Contains, I get stuck on the =...
part.
Please, help me. I'm in trouble with this CSV. What I want is to have that value under each column name.
var1 var2 var3 var4
ZAMffFIA19 10 2 /a/n/anghelo_rujo_edited...
VINMUfgvL18 25 1 /r/u/rudi_vindimian_mulle...
Assuming you have a file like so:
123 A=2,B=asdjhf,C=jhdkfhskdf,D=1254
54878754 A=45786,D=asgfd,C=1234
and your file is not huge, you can append to the dataframe iteratively:
df = pd.DataFrame(columns=["sku", "A", "B", "C", "D"])
with open("data_mangled.csv") as f:
for line in f:
d = {}
col1, col2 = line.split()
d["sku"] = col1
cols = col2.split(",")
for item in cols:
k,v = item.split("=")
d[k] = v
for col in df.columns: # add potentially missing columns as None
if col not in d:
d[col] = None
df = df.append(d, ignore_index=True)
print(df)
This would also deal with a situation where some of the column names are missing in the second place or are switched.
Output:
sku A B C D
0 123 2 asdjhf jhdkfhskdf 1254
1 54878754 45786 None 1234 asgfd
EDIT: For your specific data:
with open("data_real.txt") as f:
# use the first line as column names in the dataframe
col_names = f.readline()
df = pd.DataFrame(columns=col_names.split(","))
print(col_names)
for line in f:
d = {}
# lines have more than 2 columns, but the trailing values are empty
# so the format is col1,large_col2,,,,,,,
col1, *col2 = line.split(",")
d["sku"] = col1
for item in col2:
try:
if item.strip(): # disregard the empty trailing columns
k,v = item.split("=")
# we split on comma, so have occasional "
k = k.strip('"')
v = v.strip('"')
d[k] = v
except ValueError as e:
# there is a column value with missing key
print("Could not assign to column:", d["sku"], item)
for col in df.columns:
if col not in d:
d[col] = None
df = df.append(d, ignore_index=True)
print(df)
df.to_csv("data_parsed.csv") # save
One of the columns was not in the key=value format: Could not assign to column: PRACLA16 16 months on less
Note: newer Python versions will complain that append
is deprecated, I chose to ignore this here, can be solved by converting the dict to a dataframe and joining the two dataframes.