mysqldatetimejoinapproximate

MySQL join 1-to-1 by closest datetime


I need to join two tables by approximate date in such a way that every row from table old gets matched to only the closest date in table new - one-to-one. No copies of new rows are allowed - only match once for the smallest difference.

Here is some example to try:

CREATE TABLE `new` (`ID` int(2), `date` datetime, `new` varchar(1));
INSERT INTO `new` (`ID`, `date`, `new`) VALUES
(1, '2016-03-02 12:20:00', 't'),
(1, '2016-03-07 12:20:00', 'u'),
(1, '2016-04-02 12:20:00', 'v'),
(2, '2016-04-12 11:03:00', 'x');

CREATE TABLE `old` (`ID` int(2), `date` datetime, `old` varchar(1));
INSERT INTO `old` (`ID`, `date`, `old`) VALUES
(1, '2016-03-07 12:20:00', 'a'),
(1, '2016-04-02 12:20:00', 'b'),
(1, '2016-03-01 10:09:00', 'c'),
(1, '2015-04-12 10:09:00', 'd'),
(1, '2016-03-03 12:20:00', 'e');

The output I'm expecting is this:

   ID            old.date  old            new.date  new
    1 2016-03-07 12:20:00    a 2016-03-07 12:20:00    u
    1 2016-04-02 12:20:00    b 2016-04-02 12:20:00    v
    1 2016-03-01 10:09:00    c                NULL NULL
    1 2015-04-12 10:09:00    d                NULL NULL
    1 2016-03-03 12:20:00    e 2016-03-02 12:20:00    t
    2                NULL NULL 2016-04-12 11:03:00    x

I was able to get somewhat close to this with:

SELECT * FROM old A LEFT OUTER JOIN new B ON A.ID=B.ID AND ABS(TIMESTAMPDIFF(day, A.date, B.date))<2
UNION
SELECT * FROM old A RIGHT OUTER JOIN new B ON A.ID=B.ID AND ABS(TIMESTAMPDIFF(day, A.date, B.date))<2
ORDER BY old

but clearly this ends up matching multiple rows within the specified time window instead of only the best match. Playing with number of days is not a solution for me, as in practice I have two huge tables to join and this needs to work with a time window in which there would be multiple matches in many rows.


Solution

  • It seems the only way to achieve approximate 1-to-1 join is by using a cursor within the stored procedure.

    Thank you @Strawberry for pointing me in the right direction - you will see pieces of your code reused below. Here is the solution that eventually worked for me. It outputs records sorted differently, but at least it is truly 1-to-1 match.

    DROP PROCEDURE IF EXISTS amerge;
    
    DELIMITER //
    
    CREATE PROCEDURE amerge()
    
    BEGIN
      /* Necessary declarations */
      DECLARE o_ID INT DEFAULT 0;
      DECLARE o_date VARCHAR(30) DEFAULT 0;
      DECLARE o_old VARCHAR(2);
      DECLARE o_mdiff FLOAT;
      DECLARE ct INT DEFAULT 0;
      DECLARE done INT DEFAULT FALSE;
      DECLARE cursor1 CURSOR FOR SELECT ID, date, old, mdiff FROM t1;
      DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
    
      /* Temporary tables */
      -- copy of 'old' with diff column = min difference
      CREATE TEMPORARY TABLE t1
        SELECT old.*,MIN(ABS(TIMESTAMPDIFF(hour, old.date, new.date))) AS mdiff
          FROM old JOIN new ON old.ID=new.ID
          GROUP BY old.ID, old.date
          ORDER BY mdiff ASC;
    
      -- cartesian join with abs(diff) column
        CREATE TEMPORARY TABLE t2
        SELECT old.ID AS ID_1, old.date AS date_1, new.ID as ID_2, new.date AS date_2, old, new,
            ABS(TIMESTAMPDIFF(hour, old.date, new.date)) AS diff
          FROM old CROSS JOIN new ON old.ID=new.ID
          ORDER BY diff ASC;
    
      -- empty table to fill in with the results
      CREATE TEMPORARY TABLE t3
        (id_1 INT, date_1 DATETIME, id_2 INT, date_2 DATETIME, old VARCHAR(2), new VARCHAR(2), diff FLOAT);
    
      /* Cursor */
      OPEN cursor1;
      getparams: LOOP
            FETCH cursor1 INTO o_ID, o_date, o_old, o_mdiff;
            IF done THEN
                LEAVE getparams;
            END IF;
          SELECT COUNT(*) FROM t2 WHERE t2.ID_1=o_ID AND t2.date_1=o_date AND t2.old=o_old AND t2.diff=o_mdiff INTO ct;
        CASE ct
            WHEN 0 THEN
              INSERT INTO t3 VALUES (o_ID, o_date, NULL, NULL, o_old, NULL, o_mdiff);
            ELSE
              INSERT INTO t3 SELECT * FROM t2 WHERE t2.ID_1=o_ID AND t2.date_1=o_date AND t2.old=o_old AND t2.diff=o_mdiff LIMIT 1;
        END CASE;
        DELETE FROM t2 WHERE t2.ID_2=o_ID AND t2.date_2 IN (SELECT date_2 FROM t3 WHERE t3.date_1=o_date);
        END LOOP getparams;
        CLOSE cursor1;
    
      /* Workaround for error of reopening temp tables in MySQL */
      DROP TEMPORARY TABLE t2;
      CREATE TEMPORARY TABLE t2
          SELECT * FROM t3;
    
      /* Output */
      SELECT * FROM t2
      UNION
      SELECT NULL AS ID_1, NULL AS date_1, new.ID as ID_2, new.date AS date_2, NULL AS old, new.new AS new, NULL AS diff
        FROM new LEFT JOIN t3 ON t3.ID_2=new.ID AND t3.date_2 = new.date WHERE t3.ID_2 IS NULL;
    
    
    END //
    
    DELIMITER ;
    
    CALL amerge();
    

    And the output is (using data from the above example, with PRIMARY key set to ID+date):

    id_1             date_1 id_2             date_2  old  new diff  
    1   2016-03-07 12:20:00 1   2016-03-07 12:20:00    a    u    0  
    1   2016-04-02 12:20:00 1   2016-04-02 12:20:00    b    v    0  
    1   2016-03-03 12:20:00 1   2016-03-02 12:20:00    e    t   24  
    1   2016-03-01 10:09:00                    NULL    c NULL   26  
    1   2015-04-12 10:09:00                    NULL    d NULL 7802  
                       NULL 2   2016-04-12 11:03:00 NULL    x NULL