Why is my python3 code to split column 4 at the bases of "," to the column 7, 8, 9 with condition if the column 5 value is "LUB" not working?
source data:
20240801 CASH MAN1 120.00 Z20/1 0.5 , Z20/1 ; 120 1 LUB
20240801 PAYTM 15720.00 CASH MAN1 0 1 JRNLCR
20240801 PAYTM 81343.00 CASH MAN2 0 2 JRNLCR
my code:
import numpy as np
import pandas as pd
var001_sf = input("Enter file name to import : ")
var002_sf = var001_sf + ".csv"
var002_ar1 = pd.read_csv(var002_sf, sep='\t', header=None)
var002_ar1 = var002_ar1.dropna(axis=0) # remove rows with na value
var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]] = var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]].map(str)
var002_ar1.loc[var002_ar1.iloc[:, 6] == 'LUB', [7, 8, 9]] = var002_ar1.iloc[:, 4].str.split(', ', n=2, expand=True)
var002_ar1.to_csv("t250221.csv", sep='\t')
output data:
20240801 CASH MAN1 120.0 Z20/1 0.5 , Z20/1 ; 120 1 LUB
1 20240801 PAYTM 15720.0 CASH MAN1 0 1 JRNLCR
2 20240801 PAYTM 81343.0 CASH MAN2 0 2 JRNLCR
I can't get it to split "0.5 , Z20/1 ; 120" in column 7, 8, 9
What should I do for the desired result?
There's two problems with your example.
Your input has an issue with it. In column 4, row 1: "0.5 , Z20/1 ; 120"
the second delimiter is a semicolon (";") instead of a comma (",").
If your raw data has these issues, then you may want to clean or
preprocess it initially; for instance, converting semicolons to
commas.
Your split dataframe has columns [0, 1, 2]
. You'll want to modify the columns to match your output of [7, 8, 9]
.
You can do this with the set_axis
method. As in:
splitDf = var002_ar1.iloc[:, 4].str.split(', ', n=2, expand=True).set_axis([7, 8, 9], axis=1)
Now you can set it as you did before:
var002_ar1.loc[var002_ar1.iloc[:, 6] == 'LUB', [7, 8, 9]] = splitDf
Note: with your input issue in 1., set_axis
will error as the semicolon will result in the split dataframe having only two columns instead of three (Length mismatch
). You will have to fix that first.
Putting the new input and new code all together:
input:
20240801 CASH MAN1 120.00 Z20/1 0.5 , Z20/1 , 120 1 LUB
20240801 PAYTM 15720.00 CASH MAN1 0 1 JRNLCR
20240801 PAYTM 81343.00 CASH MAN2 0 2 JRNLCR
code:
import pandas as pd
var001_sf = input("Enter file name to import : ")
var002_sf = var001_sf + ".csv"
var002_ar1 = pd.read_csv(var002_sf, sep='\t', header=None)
var002_ar1 = var002_ar1.dropna(axis=0) # remove rows with na value
var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]] = var002_ar1.iloc[:, [0, 1, 2, 3, 4, 5, 6]].map(str)
splitDf = var002_ar1.iloc[:, 4].str.split(', ', n=2, expand=True).set_axis([7, 8, 9], axis=1)
var002_ar1.loc[var002_ar1.iloc[:, 6] == 'LUB', [7, 8, 9]] = splitDf
var002_ar1.to_csv("t250221.csv", sep='\t')
output:
0 1 2 3 4 5 6 7 8 9
0 20240801 CASH MAN1 120.0 Z20/1 0.5 , Z20/1 , 120 1 LUB 0.5 Z20/1 120
1 20240801 PAYTM 15720.0 CASH MAN1 0 1 JRNLCR
2 20240801 PAYTM 81343.0 CASH MAN2 0 2 JRNLCR