I have a dataframe df
with only one column (Column_1) and corresponding 5 records as follows:
Column_1 |
---|
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1 |
I want to split the values based on '|break|' string as a separator and add the upcoming values as a next new record in the same dataframe in the same column making it 10 records in a sequential manner.
Column_1 |
---|
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW |
006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW |
006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW |
006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW |
006||0000099598|000000|0000099598-1 |
0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW |
006||0000099598|000000|0000099598-1 |
if you want this result with pyspark :
+------------------------------------------------------------------+
|Column_1 |
+------------------------------------------------------------------+
|0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|
|006||0000099598|000000|0000099598-1 |
+------------------------------------------------------------------+
you can try this :
from pyspark.sql.functions import split, explode
# input data
data = [
("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",)
]
df = spark.createDataFrame(data, ["Column_1"])
df_split = df.withColumn("Split_Column", split(df["Column_1"], r"\|break\|"))
df_exploded = df_split.withColumn("Column_1", explode("Split_Column")).select("Column_1")
df_exploded.show(10, False)