I need to parse some text data in Python PySpark on Databricks. The data look like this:
df = spark.createDataFrame([("new entry", 1, 123),
("acct", 2, None),
("cust ID", 3, None),
("new entry", 4, 456),
("acct", 5, None),
("more text", 6, None),
("cust ID", 7, None)],
("value", "line num", "tracking ID"))
Here I manually added the "need grouping" column to illustrate - rows from "new entry" to "cust ID" are one group, followed by another. They are not all the same length.
I need to match up cust ID with the tracking ID a few lines before, so something like this:
How can I match cust ID with tracking ID? I thought of a window function but I'm not sure how to create the needed grouping.
To resolve your issue usewindow.orderBy
function along with last()
to fill the forward values.
from pyspark.sql.window import Window
from pyspark.sql.functions import col, last
df = spark.createDataFrame([
("new entry", 1, 123),
("acct", 2, None),
("cust ID", 3, None),
("new entry", 4, 456),
("acct", 5, None),
("more text", 6, None),
("cust ID", 7, None)
], ["value", "line num", "tracking ID"])
# use window function
window_fun = Window.orderBy("line num")
df_filled = df.withColumn("tracking_ID_fil", last("tracking ID", True).over(window_fun))
# Use Filter
res1 = df_filled.filter(col("value") == "cust ID").select("value", "tracking_ID_fil")
# Rename column values
dff1 = res1.withColumnRenamed("tracking_ID_fil", "tracking ID")