pythondateapache-sparkpysparkapache-spark-sql

Last day of quarter


Is there a way to natively select the last day of each quarter in PySpark? For example, in a df containing two columns - yyyy_mm_dd and sum. How could I return sum for the last day of each quarter? For the current / in progress quarter it would be good to show the max date instead.

I looked at this solution Get First Date and Last Date of Current Quarter in Python? and it works, however I was wondering if there is a solution using PySpark syntax rather than a udf?


Solution

  • Using a similar approach as in this answer:

    df2 = df.withColumn(
        'last_day',
        F.expr("""
            to_date(
                date_trunc('quarter', to_date(input_date) + interval 3 months)
            ) - interval 1 day
        """)
    )
    
    df2.show()
    +----------+----------+
    |input_date|  last_day|
    +----------+----------+
    |2020-01-21|2020-03-31|
    |2020-02-06|2020-03-31|
    |2020-04-15|2020-06-30|
    |2020-07-10|2020-09-30|
    |2020-10-20|2020-12-31|
    |2021-02-04|2021-03-31|
    +----------+----------+
    

    Then you can filter the rows where input_date == last_day


    Edit: I might have misunderstood the question. You can try this approach using group by on the quarter and selecting the last row in each quarter:

    from pyspark.sql import functions as F, Window
    
    df2 = df.withColumn(
        'rn', 
        F.row_number().over(Window.partitionBy(F.year('input_date'), F.quarter('input_date')).orderBy(F.desc('input_date')))
    )
    
    df2.show()
    +----------+---+
    |input_date| rn|
    +----------+---+
    |2021-02-04|  1|
    |2020-10-20|  1|
    |2020-07-10|  1|
    |2020-02-06|  1|
    |2020-01-21|  2|
    |2020-04-15|  1|
    +----------+---+
    

    And filter the rows with rn = 1, which should be the last day in each quarter.