pythonpandassplitconditional-statements

Why is my column split not working correctly?


Why is my python3 code to split column 4 at the bases of "," to the column 7, 8, 9 with condition if the column 5 value is "LUB" not working?

source data: source data screenshot

20240801    CASH MAN1   120.00  Z20/1   0.5 , Z20/1 ; 120   1   LUB
20240801    PAYTM   15720.00    CASH MAN1   0   1   JRNLCR
20240801    PAYTM   81343.00    CASH MAN2   0   2   JRNLCR

my code:

import numpy as np
import pandas as pd

var001_sf = input("Enter file name to import : ")
var002_sf = var001_sf + ".csv"
var002_ar1 = pd.read_csv(var002_sf, sep='\t', header=None)
var002_ar1 = var002_ar1.dropna(axis=0)  # remove rows with na value
var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]] = var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]].map(str)
var002_ar1.loc[var002_ar1.iloc[:, 6] == 'LUB', [7, 8, 9]] = var002_ar1.iloc[:, 4].str.split(', ', n=2, expand=True)
var002_ar1.to_csv("t250221.csv", sep='\t')

output data: output data screenshot

20240801    CASH MAN1   120.0   Z20/1   0.5 , Z20/1 ; 120   1   LUB         
1   20240801    PAYTM   15720.0 CASH MAN1   0   1   JRNLCR          
2   20240801    PAYTM   81343.0 CASH MAN2   0   2   JRNLCR          

I can't get it to split "0.5 , Z20/1 ; 120" in column 7, 8, 9

What should I do for the desired result?


Solution

  • Input data and code issues

    There's two problems with your example.

    1. Your input has an issue with it. In column 4, row 1: "0.5 , Z20/1 ; 120" the second delimiter is a semicolon (";") instead of a comma (",").
      If your raw data has these issues, then you may want to clean or preprocess it initially; for instance, converting semicolons to commas.

    2. Your split dataframe has columns [0, 1, 2]. You'll want to modify the columns to match your output of [7, 8, 9].
      You can do this with the set_axis method. As in:

      splitDf = var002_ar1.iloc[:, 4].str.split(', ', n=2, expand=True).set_axis([7, 8, 9], axis=1)
      

      Now you can set it as you did before:

      var002_ar1.loc[var002_ar1.iloc[:, 6] == 'LUB', [7, 8, 9]] = splitDf
      

      Note: with your input issue in 1., set_axis will error as the semicolon will result in the split dataframe having only two columns instead of three (Length mismatch). You will have to fix that first.

    Fixed results

    Putting the new input and new code all together:

    input:

    20240801    CASH MAN1   120.00  Z20/1   0.5 , Z20/1 , 120   1   LUB
    20240801    PAYTM   15720.00    CASH MAN1   0   1   JRNLCR
    20240801    PAYTM   81343.00    CASH MAN2   0   2   JRNLCR
    

    code:

    import pandas as pd
    
    var001_sf = input("Enter file name to import : ")
    var002_sf = var001_sf + ".csv"
    var002_ar1 = pd.read_csv(var002_sf, sep='\t', header=None)
    
    var002_ar1 = var002_ar1.dropna(axis=0)  # remove rows with na value
    var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]] = var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]].map(str)
    
    splitDf = var002_ar1.iloc[:, 4].str.split(', ', n=2, expand=True).set_axis([7, 8, 9], axis=1)
    var002_ar1.loc[var002_ar1.iloc[:, 6] == 'LUB', [7, 8, 9]] = splitDf
    
    var002_ar1.to_csv("t250221.csv", sep='\t')
    

    output:

        0   1   2   3   4   5   6   7   8   9
    0   20240801    CASH MAN1   120.0   Z20/1   0.5 , Z20/1 , 120   1   LUB 0.5     Z20/1   120
    1   20240801    PAYTM   15720.0 CASH MAN1   0   1   JRNLCR          
    2   20240801    PAYTM   81343.0 CASH MAN2   0   2   JRNLCR