apache-sparkpysparkapache-spark-sql

Spark: fill spec value between flag values


I'm trying to figure out how to put a specific value between two flag values, for example:

there is

id date flg
123 01.01.2024 1
432 02.01.2024
5234 03.01.2024
12342 06.01.2024 1
1256 07.01.2024
246 08.01.2024
846 09.01.2024
834 15.01.2024 1
343 11.01.2024
364 12.01.2024

But it should be:

id date flg hash
123 01.01.2024 1 afaef
432 02.01.2024 afaef
5234 03.01.2024 afaef
12342 06.01.2024 1 jdrjd
1256 07.01.2024 jdrjd
246 08.01.2024 jdrjd
846 09.01.2024 jdrjd
834 15.01.2024 1 qwwa
343 11.01.2024 qwwa
364 12.01.2024 qwwa

Desc: We define a range of lines between the first flag=1(inclusive) and the next one and create a new attribute with the value hash from id where the first flag is


Solution

  • You can use udf(), hashlib, and Window with rowsBetween():

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, when, last, udf
    from pyspark.sql.window import Window
    from pyspark.sql.types import StringType
    import hashlib
    
    
    def _hash(df):
        sha = lambda id: hashlib.sha256(str(id).encode('utf-8')).hexdigest()[:5]
        U = udf(sha, StringType())
        df = df.withColumn("hash", when(col("flg") == 1, U(col("id"))))
        w = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)
        df = df.withColumn("hash", last("hash", True).over(w))
        df.show()
    
    
    A = [
        (123, "01.01.2024", 1),
        (432, "02.01.2024", None),
        (5234, "03.01.2024", None),
        (12342, "06.01.2024", 1),
        (1256, "07.01.2024", None),
        (246, "08.01.2024", None),
        (846, "09.01.2024", None),
        (834, "15.01.2024", 1),
        (343, "11.01.2024", None),
        (364, "12.01.2024", None)
    ]
    spark = SparkSession.builder.getOrCreate()
    df = spark.createDataFrame(A, ["id", "date", "flg"])
    _hash(df)
    
    

    Prints

    +-----+----------+----+-----+
    |   id|      date| flg| hash|
    +-----+----------+----+-----+
    |  123|01.01.2024|   1|a665a|
    |  432|02.01.2024|NULL|a665a|
    | 5234|03.01.2024|NULL|a665a|
    |12342|06.01.2024|   1|307fd|
    | 1256|07.01.2024|NULL|307fd|
    |  246|08.01.2024|NULL|307fd|
    |  846|09.01.2024|NULL|307fd|
    |  343|11.01.2024|NULL|307fd|
    |  364|12.01.2024|NULL|307fd|
    |  834|15.01.2024|   1|5c344|
    +-----+----------+----+-----+