pythonpysparkdatabricks

How can I group these related rows using PySpark?


I need to parse some text data in Python PySpark on Databricks. The data look like this:

df = spark.createDataFrame([("new entry", 1, 123), 
    ("acct", 2, None), 
    ("cust ID", 3, None),
    ("new entry", 4, 456),
    ("acct", 5, None),
    ("more text", 6, None),
    ("cust ID", 7, None)], 
    ("value", "line num", "tracking ID"))

Here I manually added the "need grouping" column to illustrate - rows from "new entry" to "cust ID" are one group, followed by another. They are not all the same length.

enter image description here

I need to match up cust ID with the tracking ID a few lines before, so something like this:

enter image description here

How can I match cust ID with tracking ID? I thought of a window function but I'm not sure how to create the needed grouping.


Solution

  • To resolve your issue usewindow.orderBy function along with last() to fill the forward values.

    code:

    from pyspark.sql.window import Window
    from pyspark.sql.functions import col, last
    
    df = spark.createDataFrame([
    ("new entry", 1, 123),
    ("acct", 2, None),
    ("cust ID", 3, None),
    ("new entry", 4, 456),
    ("acct", 5, None),
    ("more text", 6, None),
    ("cust ID", 7, None)
    ], ["value", "line num", "tracking ID"])
    
    # use window function
    window_fun = Window.orderBy("line num")
    df_filled = df.withColumn("tracking_ID_fil", last("tracking ID", True).over(window_fun))
    display(df_filled)
    
    # Use Filter
    res1 = df_filled.filter(col("value") == "cust ID").select("value", "tracking_ID_fil")
    
    # Rename column values
    dff1 = res1.withColumnRenamed("tracking_ID_fil", "tracking ID")
    display(dff1)
    

    output: enter image description here