sqlhivepartition-by

Using Partition By to compare values based on date in Hive SQL


I have a table called referrals

CREATE TABLE [dataproduct].[referrals]
(
      [person_id] DOUBLE,
      [medical_service_cd] DOUBLE,
      [refer_from_organization_id] 
      [referral_org_name] STRING, 
      [refer_from_provider_id] DOUBLE, 
      [provider_from_name] STRING,  
      [refer_to_provider_id] DOUBLE,
      [provider_to_name] STRING,  
      [outbound_encntr_id DOUBLE,
      [order_id] DOUBLE,
      [referral_written_dt_tm] STRING,
      [requested_start_dt_tm] STRING,  
      [medical_service] STRING,  
      [referral_status] STRING, 
      [loc_code] STRING,  
      [service_type] STRING,
      [referral_reason] STRING,  
      [treatment_text] STRING

)

I want to find patients, by person_id, who were referred to Neurology FOLLOWED BY Medical Genetics (as listed under medical_service) using referral_written_dt_tm. I confirmed that some patients had both medical services, but want to find the ones who had them in the desired order. so I tried this:

  SELECT r.person_id, 
         COUNT(DISTINCT r.medical_service) as count,
         row_number() OVER (PARTITION BY person_id ORDER BY referral_written_dt_tm ASC) AS row_num 
    FROM matt_dataproduct.referrals r
   WHERE r.medical_service IN ('Medical Genetics',
                               'Neurology'
                               )
GROUP BY r.person_id,
  HAVING count > 1;

I got no results. I just want a list of patients who meet the criteria (medical genetics followed by neurology). Is there some way I should revise my query? Something about it seems off.


Solution

  • One way is probably something like the following:

    SELECT r.person_id 
    FROM matt_dataproduct.referrals rmg INNER JOIN matt_dataproduct.referrals rn
     ON rmg.person_id = rn.person_id
    WHERE rmg.medical_service ='Medical Genetics'
     AND rn.medical_service = 'Neurology'
     AND rn.eferral_written_dt_tm < rmg.eferral_written_dt_tm
    GROUP BY r.person_id
    

    This simply joins the table to itself (once for Neurology and once for Medical Genetics) and finds rows where the same patient has Neurology before Medical Genetics. The group by is being used to get distinct values (could use distinct in the select part instead).

    You said the type of eferral_written_dt_tm is a string (not an actual date). I'm assuming it's directly comparable without conversion (since you used it in your window function's order by), but if not you would replace rn.eferral_written_dt_tm < rmg.eferral_written_dt_tm with the appropriate functions or conversions so that it's comparable.

    You could use the row_number window function as you tried (or maybe a variation with lag) by wrapping it in another query to compare values. But in this case I don't see any big advantage over the simple self-join.