group-bymaxmariadb-10.3

Get max id from grouping two columns as pair


i have search a lot to find a solution to get the max id using group by of two columns as a pair in a data set but none of the queries i have found and used worked as expected. Below is an example data set:

id tour_id p1 stage rnd assoc1 p2 assoc2 winner
996057 5277 107028 Main Draw 32 GER 110673 IRI 107028
996101 5277 107028 Main Draw 16 GER 105136 FRA 107028
996126 5277 107028 Main Draw 8 GER 112074 SWE 107028
996133 5277 107028 Main Draw 4 GER 123980 JPN 107028
996139 5277 107028 Main Draw 2 GER 121582 TPE 107028
996037 5277 116620 Main Draw 32 GER 121582 TPE 121582
996037 5277 121582 Main Draw 32 TPE 116620 GER 121582
996097 5277 121582 Main Draw 16 TPE 104314 IND 121582
996121 5277 121582 Main Draw 8 TPE 112092 NGR 121582
996132 5277 121582 Main Draw 4 TPE 112062 FRA 121582
996139 5277 121582 Main Draw 2 TPE 107028 GER 107028
996324 5278 107028 Main Draw 32 GER 100439 EGY 107028
996362 5278 107028 Main Draw 16 GER 104314 IND 107028
996379 5278 107028 Main Draw 8 GER 116853 SWE 107028
996390 5278 107028 Main Draw 4 GER 123980 JPN 123980
996283 5278 116620 Main Draw 64 GER 121514 KOR 121514
996313 5278 121582 Main Draw 32 TPE 106296 POR 121582
996357 5278 121582 Main Draw 16 TPE 102968 AUT 121582
996380 5278 121582 Main Draw 8 TPE 102761 GER 102761
998765 5299 101222 Main Draw 64 GER 118671 DEN 101222
998788 5299 101222 Main Draw 32 GER 102380 ENG 101222
998801 5299 101222 Main Draw 16 GER 116620 GER 101222
998807 5299 101222 Main Draw 8 GER 116853 SWE 101222
998810 5299 101222 Main Draw 4 GER 112074 SWE 101222
998812 5299 101222 Main Draw 2 GER 107028 GER 101222
998773 5299 107028 Main Draw 64 GER 120168 TUR 107028
998797 5299 107028 Main Draw 32 GER 102891 CRO 107028
998805 5299 107028 Main Draw 16 GER 104379 SWE 107028
998809 5299 107028 Main Draw 8 GER 104036 CZE 107028
998811 5299 107028 Main Draw 4 GER 102841 POR 107028
998812 5299 107028 Main Draw 2 GER 101222 GER 101222
998757 5299 116620 Main Draw 64 GER 101192 ITA 116620
998794 5299 116620 Main Draw 32 GER 115449 AUT 116620
998801 5299 116620 Main Draw 16 GER 101222 GER 101222

What I would like to get is the following output which is basically the max(id) of the grouping of p1 and tour_id

id tour_id p1 stage rnd assoc1 p2 assoc2 winner
996139 5277 107028 Main Draw 2 GER 121582 TPE 107028
996037 5277 116620 Main Draw 32 GER 121582 TPE 121582
996139 5277 121582 Main Draw 2 TPE 107028 GER 107028
996390 5278 107028 Main Draw 4 GER 123980 JPN 123980
996283 5278 116620 Main Draw 64 GER 121514 KOR 121514
996380 5278 121582 Main Draw 8 TPE 102761 GER 102761
998812 5299 101222 Main Draw 2 GER 107028 GER 101222
998812 5299 107028 Main Draw 2 GER 101222 GER 101222
998801 5299 116620 Main Draw 16 GER 101222 GER 101222

Any help is appreciated.


Solution

  • Generally, I would take a simple query to get the max id for the conditions, and then either use it as a subquery or join, depending on the use case. Take a look at this fiddle:

    https://dbfiddle.uk/K1wM0gEK

    I've inserted your data and then a series of queries. Here's the first one, just to get the maxID for each combination of tour_id and p1:

    select tour_id, p1, max(id) as maxID 
    from t group by tour_id, p1;
    

    which you can then use in a subquery to retrieve any rows that match those IDs like so:

    select * from t
    where id in (
      select max(id)
      from t group by tour_id, p1
    );
    

    or as a JOIN:

    select t.* from t
    join (
      select max(id) as maxID
      from t group by tour_id, p1
    ) ids on t.id = ids.maxID;
    

    JOINs are usually more performant than IN for larger data sets, but that is not a hard and fast rule and the line really isn't well defined. I've included it here just for reference.

    Now, these queries SHOULD be returning the same results, but it seems that the ID you're fetching the max value for isn't a unique ID, so they aren't, and it really depends on what you are trying to accomplish as to which answer is right. Here's one more option using window functions, which are really overkill for this, but let's look:

    select tour_id, p1, 
      first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
      first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
      first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
      first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
      first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
      first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
      first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
    from t 
    

    Now this returns a LOT more rows, but a lot of them are duplicates, so let's add DISTINCT to just get the uniques:

    select DISTINCT tour_id, p1, 
      first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
      first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
      first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
      first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
      first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
      first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
      first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
    from t 
    

    and now we're down to something that looks a little more like what you were after. For comparison, I have included the three queries side by side, ordered by id and with the columns all in the same order:

    select DISTINCT 
      first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
      tour_id, p1, 
      first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
      first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
      first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
      first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
      first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
      first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
    from t order by 1;
      
    select * from t
    where id in (
      select max(id)
      from t group by tour_id, p1
    ) order by id;
    
    select t.* from t
    join (
      select max(id) as maxID
      from t group by tour_id, p1
    ) ids on t.id = ids.maxID
    order by t.id;
    

    The result set using the window functions seems to have the same output as you're looking for, but let me say that it seems like window functions are overkill for a case this simple, so I'm wondering if you need some unique ID instead. If you don't have a unique primary (autoincrementing) ID in your table(s), you should. It will save you a lot of headache at some point down the road. If you do, I wonder why we aren't using that instead of the non-unique one.

    Let me know if this helps, or if anything is unclear.