apache-sparkapache-spark-sqldatabricks

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?


What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.

So if I have a select statement that does something like

select 
  person,
  last(team)
from
  (select * from person_team order by date_joined)
group by person

I should get the last team a person joined, yes/no?

The actual query I'm running is shown below. It is returning a different number each time I execute the query.

select count(distinct patient_id) from (
  select
    patient_id,
    org_patient_id,
    last_value(data_lot) data_lot
  from
    (select * from my_table order by data_lot)
  where 1=1
    and org = 'my_org'
  group by 1,2
  order by 1,2
)
where data_lot in ('2021-01','2021-02')
;

What is the correct way to get the last value for a given field (for either the team example or my specific example)?

--- EDIT -------------------

I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:

select
  patient_id,
  last_value(collect_set(data_lot)) data_lot
from
  covid.demo
group by patient_id
;

enter image description here

Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;; Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848] +- SubqueryAlias spark_catalog.covid.demo

The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.

Getting last value of group in Spark

Find maximum row per group in Spark DataFrame

--- SECOND EDIT -------------------

I ended up using something like this based upon the accepted answer.

  select 
    row_number() over (order by provided_date, data_lot) as row_num,
    demo.*
  from demo

Solution

  • You can assign row numbers based on an ordering on data_lots if you want to get its last value:

    select count(distinct patient_id) from (
        select * from (
            select *,
                row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
            from my_table
            where org = 'my_org'
        )
        where rn = 1
    ) 
    where data_lot in ('2021-01','2021-02');