sqlsql-serversql-server-2012

Multiple rows aggregated to single row


Business:
I have one manager tied to multiple employees/single employees across different business and Manager and employees are in different rows. We have to map on ID and also on zip codes (both zip5 and zip3).If there is no match on ID, map it on zip5 and if there is no match then on zip3. Manager is main table.

Code:
I tried to segregate manager and employee information into two different tables and then try to left join employee with manager. I have to join on ID Or zip5 or zip3 and business. I tried below logic but the join conditions are messing up the data because of nulls. I also tried max, but it wont work when there are multiple employees for same manager.

DROP TABLE IF EXISTS #table;
create table #table
(ID int,
zip5 int,
zip3 int,
Manager_Region varchar(50),
manager_Name varchar(100),
Employee_name varchar(50),
Employee_region varchar(100),
business varchar(20),
source varchar(20)
)

Insert into #table
select 1234,null,null,'East','Mike',null,null,'Healthcare','Source1'
UNION ALL
select 1234,null,null,null,null,'John','NorthEast','Healthcare','Source1'
UNION ALL
select 1234,null,null,null,null,'Patty','EastCentral','Healthcare','Source1'
UNION ALL
select Null,34589,null,'West','Kat',Null,Null,'dental','Source2'
UNION ALL
select Null,34589,null,null,Null,'Kim','NorthWestern','dental','Source2'
UNION ALL
select Null,null,345,'South','rita',Null,Null,'Healthcare','Source3'
UNION ALL
select Null,null,345,null,Null,'Sam','southCentral','Healthcare','Source3'


DROP TABLE IF EXISTS #Manager_Is_Null;
select * into #Manager_Is_Null
from #table 
where  Manager_Region is null


DROP TABLE IF EXISTS #Employee_is_Null;
select * into #Employee_is_Null
from #table 
where  Employee_Region is null 


select s.ID,s.zip5,s.zip3,
s.business,
s.Manager_Region,
S.Manager_Name,
a.employee_Region,
a.employee_Name
from #Employee_is_Null s
left join #Manager_Is_Null a
on ((isnull(a.ID,'') = isnull(s.ID,'')) OR
        (isnull(a.ID,'') <> isnull(s.ID,'') AND isnull(a.zip5,'') = isnull(s.zip5,'')) OR
        (isnull(a.ID,'') <> isnull(s.ID,'') AND isnull(a.zip5,'') <> isnull(s.zip5,'') AND isnull(a.zip3,'') = isnull(a.zip3,'')))
        and a.business=s.business

Expected Output:

ID zip5 zip3 Manager_Region Manager_Name Employee_Name Employee_Region Business source
1234 null null East Mike John NorthEast Healthcare Source1
1234 null null East Mike Patty EastCentral Healthcare Source1
null 34589 null West Kat Kim NorthWestern dental Source2
null null 345 South rita Sam southCentral Healthcare Source3

How can I acheive this resultset?


Solution

  • Using window functions

    This kind of problem (getting only one "best match" per row to output) can be solved by using window functions, and in particular row_number() over (…) (introduced on SQL Server 2005).

    row_number() over (partition by criteria1, criteria2 order by criteria3, criteria4, criteria5) will compute on the server (and return to you), for each row of the result set (after joins and where have been applied), what its position would be in a sub-resultset consisting only of rows with the same criteria1 and criteria2, ordered by criteria3, criteria4, criteria5.

    Thus in your case, after joining employees to managers,
    if criteria1, criteria2 was your employee's identifier (ID and name?),
    and order by criteria3, criteria4, criteria5 was an order by [do IDs match?], [do zip5s match?], [do zip3s match?],
    then potential employee - manager combinations would be sorted (within each employee) with first employers having the same ID, then employers with different ID but same zip5, then those with different ID and zip5 but same zip3.
    And your row_number() would be the position of the employee - employer combination within the combinations for this employee: thus each employee would have its best matching employer with row_number() = 1.

    However you can't then filter on the results of row_number() through the where (remember I said earlier that window functions where computed after the where?),
    so you'll have to first ask for the full resultset, with its row_number()-generated column, in a subquery (or Common Table Expression), then have a wrapping query to keep only the rows with row_number() = 1.

    Last hint: you don't need *_Is_Not_Null tables, you can use a self-join of #table (#table joined to itself, but each occurrence with a different alias: #table as employee left join #table as manager) and use on and where to ensure the employee alias only contains employees, and employers only employers.

    Your query would then be:

    with
      scored_matches as
      (
        select
          e.ID,e.zip5,e.zip3,
          e.business,
          m.Manager_Region,
          m.Manager_Name,
          e.employee_Region,
          e.employee_Name,
          row_number() over
          (
            partition by e.ID, e.employee_Name
            order by
              case when m.ID = e.ID then 0 end,
              case when m.zip5 = e.zip5 then 0 end,
              case when m.zip3 = e.zip3 then 0 end
          ) match_pos
        from #table e
        left join #table m on m.Manager_name is not null and
        (
             m.ID = e.ID
          or m.zip5 = e.zip5
          or m.zip3 = e.zip3
        )
        where e.Employee_name is not null
      )
    select * from scored_matches
    where match_pos = 1;
    

    which returns:

    ID zip5 zip3 business Manager_Region Manager_Name employee_Region employee_Name match_pos
    null 34589 null dental West Kat NorthWestern Kim 1
    null null 345 Healthcare South rita southCentral Sam 1
    1234 null null Healthcare East Mike NorthEast John 1
    1234 null null Healthcare East Mike EastCentral Patty 1

    (as generated as the last query of this fiddle, on SQL Server 2014 (2012 is really too old, I can't test on it))

    With top 1

    Another similar solution that I won't expand here is to use a select top 1 subquery:

    select …
    from employees
    cross apply (select top 1 … from employers where … order by …) as employers;
    

    which in essence means: for each employee, join to employers but keep only the first employer row (after the order by, thus the best match) matching the employee (the employee-centered selection being done through the where, with clauses like employer.ID = employee.ID or …).

    By carefully joining

    While applying the general method above (which makes sure there is only one manager per employee), I noticed that every employee - manager combination was ranked 1.
    Which means that in your dataset you already have no ambiguity (there are no 2 "manager candidates" for any employee, for example 1 matching the ID and 1 super-manager matching the zip5, with a need to select the most direct manager (the ID-matching one) over the other).

    In that case where there is at most 1 manager matching ID or zip5 or zip3 you don't need the window function and its post-filtering in a wrapping query, you simply can take the inner query:

        select
          e.ID,e.zip5,e.zip3,
          e.business,
          m.Manager_Region,
          m.Manager_Name,
          e.employee_Region,
          e.employee_Name
        from #table e
        left join #table m on m.Manager_name is not null and
        (
             m.ID = e.ID
          or m.zip5 = e.zip5
          or m.zip3 = e.zip3
        )
        where e.Employee_name is not null;
    

    (see it added at the end of this fiddle)

    But I would not recommend it, as it relies heavily on your business constraints (no 2 possible managers per employee) being maintained in your database; on the contrary, row_number() robustly ensures that, even if an employee accidentally has 2 managers, only 1 will get through the filter (even if they have the same level of matching, for example both have the same ID, row_number() will return different values, arbitrarily choosing one as 1 and the other as 2).