I have a pyspark dataframe with below data
[
My code:
W = Window.partitionBy("A").orderBy(col("C"))
main_df = main_df.withColumn("cnt", F.count("B").over(W))
Is there something wrong in how I have used the count function? What can I do so the values in column 'Actual' match with 'Expecting'? I see two issues with my output -
Can anyone help me resolve this issue?
Try with dense_rank
window function and substract 1
from the dense_rank
value.
Example:
df = spark.createDataFrame([('0','king','4/30/2022'),('1','king','5/17/2022'),('2','king','5/26/2022'),('3','king','6/5/2022'),('4','king','7/8/2022'),('5','king','7/9/2022'),('6','king','4/21/2023'),('7','king','4/21/2023'),('8','king','4/22/2023'),('9','king','8/22/2023')],['#','A','C'])
W = Window.partitionBy("A").orderBy(col("C"))
df.withColumn("c",to_date(col("C"),"M/dd/yyyy")).\
withColumn("Expecting",dense_rank().over(W)-1).show()
#+---+----+----------+---------+
#| #| A| c|Expecting|
#+---+----+----------+---------+
#| 0|king|2022-04-30| 0|
#| 1|king|2022-05-17| 1|
#| 2|king|2022-05-26| 2|
#| 3|king|2022-06-05| 3|
#| 4|king|2022-07-08| 4|
#| 5|king|2022-07-09| 5|
#| 6|king|2023-04-21| 6|
#| 7|king|2023-04-21| 6|
#| 8|king|2023-04-22| 7|
#| 9|king|2023-08-22| 8|
#+---+----+----------+---------+