I have the following Dataframe:
---------+--------+----------+-----------+--------------------+--------------------+-------+-----+------------
| id|groupid|| field| oldstring| newstring| created| pkey| project
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128| Team1| [RLA N1]| [N1-UO-SB]| 2013-03-29 13:31:...|DSTECH-55770| 10120|
|1451958| 594140| Team1| [SEP N2]| [SEP N2]| 2013-03-29 13:34:...|DSTECH-56998| 10120|
|1452282| 594308| Team1| [N1-UO-SE]| [SEP N2]| 2013-03-29 14:09:...|DSTECH-57900| 10120|
|1492252| 610736| Team1| [N1-UO-SE]| [SEP N2]| 2013-04-17 08:48:...|DSTECH-59560| 10120|
|5105082|2304145| Team1| [Aucun]|[SEP-SUPPORT]| 2017-09-01 09:46:...| ECO-9781| 10280|
|5105084|2304145| Team2| null| SEP-SUPPORT| 2017-09-01 09:46:...| ECO-9781| 10280|
|5105084|2304145| Team1| [ISR N2]| SEP-SUPPORT | 2013-03-29 13:31:... |DSTECH-57895| 10120|
|1451926|594129 | Team1| [N1-UO-SE]| [ISR N2] |2013-03-29 13:55:... |DSTECH-57895| 10120|
|1452182|594273 | Team1| [N1-UO-SE]| [SEPN1-ENV] |2013-03-29 13:43:... |DSTECH-57895| 10120|
I would like compute the treatment date/time of the [pkey]
For exemple, I have these two lines:
| id|groupid|| field| oldstring| newstring| created| pkey|
+-------+-------+---------+--------------------+--------+-------------+--------+-------------+-------+-------+
|1451923| 594128| Team1| [RLA N1]| [N1-UO-SB]| 2013-03-29 13:31:...|DSTECH-55770|
|1451958| 594140| Team1| [SEP N2]| [SEP N2]| 2013-03-29 13:34:...|DSTECH-56998|
The treatment date/time of [DSTECH-55770] = [2013-03-29 13:34:...] - [2013-03-29 13:31:...]
How can I to compute this difference with the previous date, I found that I can do it using User Defined Aggregating Functions UDAF. But I do not if this solution is useful to display the difference between two dates in a number (for example: 8h:30min), I do not mean by 8H is at clock 8H but number of hours is 8.
If someone can help me how can I do it using UDAF or if you have another solution ? Thank you
Might be a case for SQL window functions. You can find more details here
I suspect the resulting code might look something like
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val sparkSession = ... // Create as do
import sparkSession.implicits._
// For the same project, order rows by `created` column
val partitionWindow = Window.partitionBy("project").orderBy("created".asc)
// Get me the value of `created` column in next row in a new column called datediff
val createdTimeNextRowSameProject = lead($"created",
1, // 1 = next_row, 2 = 2 rows after, so on
"CURRENT_TIMESTAMP" // default if next is null
).over(partitionWindow)
val dfWithTimeDiffInSeconds = df.withColumn("datediff", unix_timestamp(leadDate) - unix_timestamp($"created"))
dfWithTimeDiffInSeconds.show(10)