pythondataframepyspark

How to explode comma separated values in data frame using pyspark


I have data like the below :

ID               ID1                                  ID2
32336741        ["32361087"]                        ["36013040"]
32290433        ["32223150-32223653"]               ["36003347-36003348"]
32299856        ["32361087","32299991","32223653"]  ["36013040","36013029","36013040"]

In the Data frame I'm trying to explode the comma separated values into multiple rows. code :

fulldf = (df
             .withColumn('ID1',F.explode(F.split('ID1','-')))
             .withColumn("ID1",F.regexp_replace("ID1", r"\[|\]|""\"", ""))
            )
fulldf = fulldf.dropna()
fulldf.display()

result :

ID              ID1
32336741        36013040
32290433        36003347
32290433        36003348
32290825        36013045
32290825        36013046
32290825        36013338

but when I add column ID2 in the data frame syntax it is giving me multiple records like doubled records.

expected out put :

ID                ID1       ID2
32336741        32361087  36013040
32290433        32223150  36003347
32290433        32223653  36003348
32290825        32361087  36013045
32290825        32299991  36013046
32290825        32223653  36013338

Solution

  • Instead of splitting etc. just extract numbers if what you show in example is consistent.

    >>> df = spark.createDataFrame([
            (32336741,'["32361087"]','["36013040"]'),
            (32290433,'["32223150-32223653"]','["36003347-36003348"]'),
            (32299856,'["32361087","32299991","32223653"]','["36013040","36013029","36013040"]'),
        ], schema='id: int, id1: string, id2: string')
    
    
    >>> df.select(F.regexp_extract_all(df.id1, F.lit('(\d+)'))).show(truncate=False)
    +---------------------------------+
    |regexp_extract_all(id1, (\d+), 1)|
    +---------------------------------+
    |[32361087]                       |
    |[32223150, 32223653]             |
    |[32361087, 32299991, 32223653]   |
    +---------------------------------+
    
    >>> df.select(F.explode(F.regexp_extract_all(df.id1, F.lit('(\d+)'))).alias('id1_split')).show()
    +---------+
    |id1_split|
    +---------+
    | 32361087|
    | 32223150|
    | 32223653|
    | 32361087|
    | 32299991|
    | 32223653|
    +---------+
    
    >>>
    

    Then depending on what you want you just need to join/cross-join original df with this exploded df to get what you want.