scalauser-defined-aggregate

Compute difference between date and its previvous date


I have the following Dataframe:

   ---------+--------+----------+-----------+--------------------+--------------------+-------+-----+------------
|     id|groupid||              field|   oldstring|    newstring|               created|        pkey|   project     
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128|               Team1|    [RLA N1]|   [N1-UO-SB]|  2013-03-29 13:31:...|DSTECH-55770|   10120|
|1451958| 594140|               Team1|    [SEP N2]|     [SEP N2]|  2013-03-29 13:34:...|DSTECH-56998|   10120|   
|1452282| 594308|               Team1|  [N1-UO-SE]|     [SEP N2]|  2013-03-29 14:09:...|DSTECH-57900|   10120|    
|1492252| 610736|               Team1|  [N1-UO-SE]|     [SEP N2]|  2013-04-17 08:48:...|DSTECH-59560|   10120|
|5105082|2304145|               Team1|     [Aucun]|[SEP-SUPPORT]|  2017-09-01 09:46:...|    ECO-9781|   10280|
|5105084|2304145|               Team2|        null|  SEP-SUPPORT|  2017-09-01 09:46:...|    ECO-9781|   10280|
|5105084|2304145|               Team1|    [ISR N2]| SEP-SUPPORT | 2013-03-29 13:31:... |DSTECH-57895|   10120|
|1451926|594129 |               Team1|  [N1-UO-SE]|   [ISR N2]  |2013-03-29 13:55:...  |DSTECH-57895|   10120|
|1452182|594273 |               Team1|  [N1-UO-SE]| [SEPN1-ENV] |2013-03-29 13:43:...  |DSTECH-57895|   10120|

I would like compute the treatment date/time of the [pkey] For exemple, I have these two lines:

|     id|groupid||              field|   oldstring|    newstring|               created|        pkey|        
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128|               Team1|    [RLA N1]|   [N1-UO-SB]|  2013-03-29 13:31:...|DSTECH-55770|   
|1451958| 594140|               Team1|    [SEP N2]|     [SEP N2]|  2013-03-29 13:34:...|DSTECH-56998| 

The treatment date/time of [DSTECH-55770] = [2013-03-29 13:34:...] - [2013-03-29 13:31:...]

How can I to compute this difference with the previous date, I found that I can do it using User Defined Aggregating Functions UDAF. But I do not if this solution is useful to display the difference between two dates in a number (for example: 8h:30min), I do not mean by 8H is at clock 8H but number of hours is 8.

If someone can help me how can I do it using UDAF or if you have another solution ? Thank you


Solution

  • Might be a case for SQL window functions. You can find more details here

    I suspect the resulting code might look something like

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions._
    
    val sparkSession = ...  // Create as do
    import sparkSession.implicits._
    // For the same project, order rows by `created` column
    val partitionWindow = Window.partitionBy("project").orderBy("created".asc)
    // Get me the value of `created` column in next row in a new column called datediff
    val createdTimeNextRowSameProject = lead($"created",
                                             1,                   // 1 = next_row, 2 = 2 rows after, so on 
                                             "CURRENT_TIMESTAMP"  // default if next is null
                                            ).over(partitionWindow)
    val dfWithTimeDiffInSeconds = df.withColumn("datediff", unix_timestamp(leadDate) - unix_timestamp($"created"))
    dfWithTimeDiffInSeconds.show(10)