I have data like the below :
ID ID1 ID2
32336741 ["32361087"] ["36013040"]
32290433 ["32223150-32223653"] ["36003347-36003348"]
32299856 ["32361087","32299991","32223653"] ["36013040","36013029","36013040"]
In the Data frame I'm trying to explode the comma separated values into multiple rows. code :
fulldf = (df
.withColumn('ID1',F.explode(F.split('ID1','-')))
.withColumn("ID1",F.regexp_replace("ID1", r"\[|\]|""\"", ""))
)
fulldf = fulldf.dropna()
fulldf.display()
result :
ID ID1
32336741 36013040
32290433 36003347
32290433 36003348
32290825 36013045
32290825 36013046
32290825 36013338
but when I add column ID2 in the data frame syntax it is giving me multiple records like doubled records.
expected out put :
ID ID1 ID2
32336741 32361087 36013040
32290433 32223150 36003347
32290433 32223653 36003348
32290825 32361087 36013045
32290825 32299991 36013046
32290825 32223653 36013338
Instead of splitting etc. just extract numbers if what you show in example is consistent.
>>> df = spark.createDataFrame([
(32336741,'["32361087"]','["36013040"]'),
(32290433,'["32223150-32223653"]','["36003347-36003348"]'),
(32299856,'["32361087","32299991","32223653"]','["36013040","36013029","36013040"]'),
], schema='id: int, id1: string, id2: string')
>>> df.select(F.regexp_extract_all(df.id1, F.lit('(\d+)'))).show(truncate=False)
+---------------------------------+
|regexp_extract_all(id1, (\d+), 1)|
+---------------------------------+
|[32361087] |
|[32223150, 32223653] |
|[32361087, 32299991, 32223653] |
+---------------------------------+
>>> df.select(F.explode(F.regexp_extract_all(df.id1, F.lit('(\d+)'))).alias('id1_split')).show()
+---------+
|id1_split|
+---------+
| 32361087|
| 32223150|
| 32223653|
| 32361087|
| 32299991|
| 32223653|
+---------+
>>>
Then depending on what you want you just need to join/cross-join original df with this exploded df to get what you want.