I'm trying to figure out how to put a specific value between two flag values, for example:
there is
id | date | flg |
---|---|---|
123 | 01.01.2024 | 1 |
432 | 02.01.2024 | |
5234 | 03.01.2024 | |
12342 | 06.01.2024 | 1 |
1256 | 07.01.2024 | |
246 | 08.01.2024 | |
846 | 09.01.2024 | |
834 | 15.01.2024 | 1 |
343 | 11.01.2024 | |
364 | 12.01.2024 |
But it should be:
id | date | flg | hash |
---|---|---|---|
123 | 01.01.2024 | 1 | afaef |
432 | 02.01.2024 | afaef | |
5234 | 03.01.2024 | afaef | |
12342 | 06.01.2024 | 1 | jdrjd |
1256 | 07.01.2024 | jdrjd | |
246 | 08.01.2024 | jdrjd | |
846 | 09.01.2024 | jdrjd | |
834 | 15.01.2024 | 1 | qwwa |
343 | 11.01.2024 | qwwa | |
364 | 12.01.2024 | qwwa |
Desc: We define a range of lines between the first flag=1(inclusive) and the next one and create a new attribute with the value hash from id where the first flag is
You can use udf()
, hashlib, and Window with rowsBetween()
:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, last, udf
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
import hashlib
def _hash(df):
sha = lambda id: hashlib.sha256(str(id).encode('utf-8')).hexdigest()[:5]
U = udf(sha, StringType())
df = df.withColumn("hash", when(col("flg") == 1, U(col("id"))))
w = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)
df = df.withColumn("hash", last("hash", True).over(w))
df.show()
A = [
(123, "01.01.2024", 1),
(432, "02.01.2024", None),
(5234, "03.01.2024", None),
(12342, "06.01.2024", 1),
(1256, "07.01.2024", None),
(246, "08.01.2024", None),
(846, "09.01.2024", None),
(834, "15.01.2024", 1),
(343, "11.01.2024", None),
(364, "12.01.2024", None)
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(A, ["id", "date", "flg"])
_hash(df)
+-----+----------+----+-----+
| id| date| flg| hash|
+-----+----------+----+-----+
| 123|01.01.2024| 1|a665a|
| 432|02.01.2024|NULL|a665a|
| 5234|03.01.2024|NULL|a665a|
|12342|06.01.2024| 1|307fd|
| 1256|07.01.2024|NULL|307fd|
| 246|08.01.2024|NULL|307fd|
| 846|09.01.2024|NULL|307fd|
| 343|11.01.2024|NULL|307fd|
| 364|12.01.2024|NULL|307fd|
| 834|15.01.2024| 1|5c344|
+-----+----------+----+-----+