pythonpandasdataframepyspark

Break Dataframe values and add like a new next record


I have a dataframe df with only one column (Column_1) and corresponding 5 records as follows:

Column_1
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1

I want to split the values based on '|break|' string as a separator and add the upcoming values as a next new record in the same dataframe in the same column making it 10 records in a sequential manner.

Column_1
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW
006||0000099598|000000|0000099598-1

Solution

  • if you want this result with pyspark :

    +------------------------------------------------------------------+
    |Column_1                                                          |
    +------------------------------------------------------------------+
    |0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW   |
    |006||0000099598|000000|0000099598-1                               |
    |0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW   |
    |006||0000099598|000000|0000099598-1                               |
    |0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|
    |006||0000099598|000000|0000099598-1                               |
    |0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW |
    |006||0000099598|000000|0000099598-1                               |
    |0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|
    |006||0000099598|000000|0000099598-1                               |
    +------------------------------------------------------------------+
    

    you can try this :

    from pyspark.sql.functions import split, explode
    
    
    # input data
    data = [
        ("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
        ("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
        ("0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
        ("0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
        ("0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",)
    ]
    
    df = spark.createDataFrame(data, ["Column_1"])
    
    df_split = df.withColumn("Split_Column", split(df["Column_1"], r"\|break\|"))
    
    df_exploded = df_split.withColumn("Column_1", explode("Split_Column")).select("Column_1")
    
    df_exploded.show(10, False)