I am currently trying to practice with some data manipulation procedures and have faced with the problem of how to make subset based on special condition.
Let's assume that the dataframe looks like this:
Name ID ContractDate LoanSum DurationOfDelay
A ID1 2023-01-01 10 10
A ID1 2023-01-03 15 15
A ID1 2022-12-29 20 0
A ID1 2022-12-28 40 0
B ID2 2023-01-05 15 19
B ID2 2023-01-10 30 0
B ID2 2023-01-07 35 25
B ID2 2023-01-06 35 0
My goal is to display for each unique ID (or Name) the index number of the loan issued first with DurationOfDelay > 0
Expected result:
Name ID IndexNum
A ID1 3
B ID2 1
Explanation: For ID1 four loans were issued: on 2022-12-28, 2022-12-29, 2023-01-01 and 2023-01-03. We can identify the existence of DurationOfDelay > 0 first on 2023-01-01, and this is the third loan issued to the borrower.
For ID2 also four loans were issued: on 2023-01-05, 2023-01-06, 2023-01-07 and 2023-01-10. We can identify the existence of DurationOfDelay > 0 first on 2023-01-05, and this is the firstloan issued to the borrower.
What I have done so far:
window_spec_subset = Window.partitionBy('ID').orderBy('ContractDate')
subset = df.filter(F.col('DurationOfDelay') > 0) \
.withColumn('row_num', F.row_number().over(window_spec_subset)) \
.filter(F.col('row_num') == 1) \
.drop('row_num')
subset.show()
+----+---+------------+-------+---------------+
|Name| ID|ContractDate|LoanSum|DurationOfDelay|
+----+---+------------+-------+---------------+
| A|ID1| 2023-01-01| 10| 10|
| B|ID2| 2023-01-05| 15| 19|
+----+---+------------+-------+---------------+
This code allows me to group the data in such a way that for each borrower only the loan issued first with DurationOfDelay > 0 is returned.
But I'm stacked to display the index number of the loan issued first with DurationOfDelay > 0 instead.
Would you be so kind to help me achieve these results? Any kind of help is highly appreciated!
Probably not the most efficient but it should work:
subset = (df.withColumn('IndexNum', F.row_number().over(window_spec_subset))
.filter(F.col('DurationOfDelay') > 0)
.withColumn('row_num', F.row_number().over(window_spec_subset))
.filter(F.col('row_num') == 1)
.drop('row_num'))
Output:
>>> subset.show()
+----+---+------------+-------+---------------+--------+
|Name| ID|ContractDate|LoanSum|DurationOfDelay|IndexNum|
+----+---+------------+-------+---------------+--------+
| A|ID1| 2023-01-01| 10| 10| 3|
| B|ID2| 2023-01-05| 15| 19| 1|
+----+---+------------+-------+---------------+--------+