pythonpandasnumpycsvsplit

Split columns in Csv file


I have a CSV file and it's pretty messy. The first column is fine, but all the rest of the data is in the second column. All data as VariableName1=Variable1, VariableName2=Variable2, VariableName3=Variable3, ... are in the second column.

<div class="jp-RenderedText jp-OutputArea-output" data-mime-type="text/plain">
<pre>              var1                                            var2  \
1    SfgvbdvbUJ05-1  var3=10,var4=/a/n/anghelo_rujo_edited-...   
2      OLBCANGR15  var3=10,var4=/c/a/cangrande_test.jpg,a...   
3        ZAMdvFIA19  var3=10,var4=/p/i/pierluigi_zampaglion...   
4        VINMUL18  var3=10,var4=/r/u/rudi_vindimian_mulle...   
5        PRACLA16  var3=10,var4=/p/r/pracla16_podere_prad...   
..            ...                                                ...   
175        WALLIM  var3=25,var4=/w/a/walcher_limoncello_w...   
239       SMROS20  var3=10,var4=/s/e/sella_e_mosca_rosato...   
288     SAELAMB19  var3=10,var6=Modena,bottleml=750,box_size=1...   
343        DILABB  var3=40,var4=/d/i/dilabb_distillerie_l...   
357       VANER19  var3=10,var4=/v/a/valdibella_kerasos_v...   

     var4  ...  var9  var10  var11  
1          NaN  ...   NaN           NaN            NaN  
2          NaN  ...   NaN           NaN            NaN  
3          NaN  ...   NaN           NaN            NaN  
4          NaN  ...   NaN           NaN            NaN  
5          NaN  ...   NaN           NaN            NaN  
..         ...  ...   ...           ...            ...  
175        NaN  ...   NaN           NaN            NaN  
239        NaN  ...   NaN           NaN            NaN  
288        NaN  ...   NaN           NaN            NaN  
343        NaN  ...   NaN           NaN            NaN  
357        NaN  ...   NaN           NaN            NaN  


</pre>
</div>

I took the second column as separate new data and split it with ,. But I can't separate the VariableName1=Variable1 data into VariableName columns.

When I do it with String Contains, I get stuck on the =... part.

Please, help me. I'm in trouble with this CSV. What I want is to have that value under each column name.

var1          var2         var3         var4
ZAMffFIA19     10           2         /a/n/anghelo_rujo_edited...
VINMUfgvL18    25           1         /r/u/rudi_vindimian_mulle...

Solution

  • Assuming you have a file like so:

    123     A=2,B=asdjhf,C=jhdkfhskdf,D=1254
    54878754    A=45786,D=asgfd,C=1234
    

    and your file is not huge, you can append to the dataframe iteratively:

    df = pd.DataFrame(columns=["sku", "A", "B", "C", "D"])
    
    with open("data_mangled.csv") as f:
        for line in f:
            d = {}
            col1, col2 = line.split()
            d["sku"] = col1
            cols = col2.split(",")
            for item in cols:
                k,v = item.split("=")
                d[k] = v
            for col in df.columns:   # add potentially missing columns as None
                if col not in d:
                    d[col] = None
            df = df.append(d, ignore_index=True)
    print(df)
    

    This would also deal with a situation where some of the column names are missing in the second place or are switched.

    Output:

            sku      A       B           C      D
    0       123      2  asdjhf  jhdkfhskdf   1254
    1  54878754  45786    None        1234  asgfd
    

    EDIT: For your specific data:

    with open("data_real.txt") as f:
        # use the first line as column names in the dataframe
        col_names = f.readline()
        df = pd.DataFrame(columns=col_names.split(","))
        print(col_names)
    
        for line in f:
            d = {}
            # lines have more than 2 columns, but the trailing values are empty
            # so the format is col1,large_col2,,,,,,,
            col1, *col2 = line.split(",")
            d["sku"] = col1
            for item in col2:
                try:
                    if item.strip(): # disregard the empty trailing columns
                        k,v = item.split("=")
                        # we split on comma, so have occasional "
                        k = k.strip('"') 
                        v = v.strip('"')
                        d[k] = v
                except ValueError as e:
                    # there is a column value with missing key
                    print("Could not assign to column:", d["sku"], item)
            for col in df.columns:
                if col not in d:
                    d[col] = None
            df = df.append(d, ignore_index=True)
    
        print(df)
        df.to_csv("data_parsed.csv") # save
    

    One of the columns was not in the key=value format: Could not assign to column: PRACLA16 16 months on less

    Note: newer Python versions will complain that append is deprecated, I chose to ignore this here, can be solved by converting the dict to a dataframe and joining the two dataframes.