I need to parse some text data in Python PySpark on Databricks. The data look like this:
df = spark.createDataFrame([("new entry", 1, 123),
("acct", 2, None),
("cust ID", 3, None),
("new entry", 4, 456),
("acct", 5, None),
("more text", 6, None),
("cust ID", 7, None)],
("value", "line num", "tracking ID"))
Here I manually added the "need grouping" column to illustrate - rows from "new entry" to "cust ID" are one group, followed by another. They are not all the same length.
I need to match up cust ID with the tracking ID a few lines before, so something like this:
How can I match cust ID with tracking ID? I thought of a window function but I'm not sure how to create the needed grouping.
To resolve your issue usewindow.orderBy
function along with last()
to fill the forward values.
code:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, last
df = spark.createDataFrame([
("new entry", 1, 123),
("acct", 2, None),
("cust ID", 3, None),
("new entry", 4, 456),
("acct", 5, None),
("more text", 6, None),
("cust ID", 7, None)
], ["value", "line num", "tracking ID"])
# use window function
window_fun = Window.orderBy("line num")
df_filled = df.withColumn("tracking_ID_fil", last("tracking ID", True).over(window_fun))
display(df_filled)
# Use Filter
res1 = df_filled.filter(col("value") == "cust ID").select("value", "tracking_ID_fil")
# Rename column values
dff1 = res1.withColumnRenamed("tracking_ID_fil", "tracking ID")
display(dff1)
output: