I need to join two tables by approximate date in such a way that every row from table old
gets matched to only the closest date in table new
- one-to-one. No copies of new
rows are allowed - only match once for the smallest difference.
Here is some example to try:
CREATE TABLE `new` (`ID` int(2), `date` datetime, `new` varchar(1));
INSERT INTO `new` (`ID`, `date`, `new`) VALUES
(1, '2016-03-02 12:20:00', 't'),
(1, '2016-03-07 12:20:00', 'u'),
(1, '2016-04-02 12:20:00', 'v'),
(2, '2016-04-12 11:03:00', 'x');
CREATE TABLE `old` (`ID` int(2), `date` datetime, `old` varchar(1));
INSERT INTO `old` (`ID`, `date`, `old`) VALUES
(1, '2016-03-07 12:20:00', 'a'),
(1, '2016-04-02 12:20:00', 'b'),
(1, '2016-03-01 10:09:00', 'c'),
(1, '2015-04-12 10:09:00', 'd'),
(1, '2016-03-03 12:20:00', 'e');
The output I'm expecting is this:
ID old.date old new.date new
1 2016-03-07 12:20:00 a 2016-03-07 12:20:00 u
1 2016-04-02 12:20:00 b 2016-04-02 12:20:00 v
1 2016-03-01 10:09:00 c NULL NULL
1 2015-04-12 10:09:00 d NULL NULL
1 2016-03-03 12:20:00 e 2016-03-02 12:20:00 t
2 NULL NULL 2016-04-12 11:03:00 x
I was able to get somewhat close to this with:
SELECT * FROM old A LEFT OUTER JOIN new B ON A.ID=B.ID AND ABS(TIMESTAMPDIFF(day, A.date, B.date))<2
UNION
SELECT * FROM old A RIGHT OUTER JOIN new B ON A.ID=B.ID AND ABS(TIMESTAMPDIFF(day, A.date, B.date))<2
ORDER BY old
but clearly this ends up matching multiple rows within the specified time window instead of only the best match. Playing with number of days is not a solution for me, as in practice I have two huge tables to join and this needs to work with a time window in which there would be multiple matches in many rows.
It seems the only way to achieve approximate 1-to-1 join is by using a cursor
within the stored procedure.
Thank you @Strawberry for pointing me in the right direction - you will see pieces of your code reused below. Here is the solution that eventually worked for me. It outputs records sorted differently, but at least it is truly 1-to-1 match.
DROP PROCEDURE IF EXISTS amerge;
DELIMITER //
CREATE PROCEDURE amerge()
BEGIN
/* Necessary declarations */
DECLARE o_ID INT DEFAULT 0;
DECLARE o_date VARCHAR(30) DEFAULT 0;
DECLARE o_old VARCHAR(2);
DECLARE o_mdiff FLOAT;
DECLARE ct INT DEFAULT 0;
DECLARE done INT DEFAULT FALSE;
DECLARE cursor1 CURSOR FOR SELECT ID, date, old, mdiff FROM t1;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
/* Temporary tables */
-- copy of 'old' with diff column = min difference
CREATE TEMPORARY TABLE t1
SELECT old.*,MIN(ABS(TIMESTAMPDIFF(hour, old.date, new.date))) AS mdiff
FROM old JOIN new ON old.ID=new.ID
GROUP BY old.ID, old.date
ORDER BY mdiff ASC;
-- cartesian join with abs(diff) column
CREATE TEMPORARY TABLE t2
SELECT old.ID AS ID_1, old.date AS date_1, new.ID as ID_2, new.date AS date_2, old, new,
ABS(TIMESTAMPDIFF(hour, old.date, new.date)) AS diff
FROM old CROSS JOIN new ON old.ID=new.ID
ORDER BY diff ASC;
-- empty table to fill in with the results
CREATE TEMPORARY TABLE t3
(id_1 INT, date_1 DATETIME, id_2 INT, date_2 DATETIME, old VARCHAR(2), new VARCHAR(2), diff FLOAT);
/* Cursor */
OPEN cursor1;
getparams: LOOP
FETCH cursor1 INTO o_ID, o_date, o_old, o_mdiff;
IF done THEN
LEAVE getparams;
END IF;
SELECT COUNT(*) FROM t2 WHERE t2.ID_1=o_ID AND t2.date_1=o_date AND t2.old=o_old AND t2.diff=o_mdiff INTO ct;
CASE ct
WHEN 0 THEN
INSERT INTO t3 VALUES (o_ID, o_date, NULL, NULL, o_old, NULL, o_mdiff);
ELSE
INSERT INTO t3 SELECT * FROM t2 WHERE t2.ID_1=o_ID AND t2.date_1=o_date AND t2.old=o_old AND t2.diff=o_mdiff LIMIT 1;
END CASE;
DELETE FROM t2 WHERE t2.ID_2=o_ID AND t2.date_2 IN (SELECT date_2 FROM t3 WHERE t3.date_1=o_date);
END LOOP getparams;
CLOSE cursor1;
/* Workaround for error of reopening temp tables in MySQL */
DROP TEMPORARY TABLE t2;
CREATE TEMPORARY TABLE t2
SELECT * FROM t3;
/* Output */
SELECT * FROM t2
UNION
SELECT NULL AS ID_1, NULL AS date_1, new.ID as ID_2, new.date AS date_2, NULL AS old, new.new AS new, NULL AS diff
FROM new LEFT JOIN t3 ON t3.ID_2=new.ID AND t3.date_2 = new.date WHERE t3.ID_2 IS NULL;
END //
DELIMITER ;
CALL amerge();
And the output is (using data from the above example, with PRIMARY key set to ID+date):
id_1 date_1 id_2 date_2 old new diff
1 2016-03-07 12:20:00 1 2016-03-07 12:20:00 a u 0
1 2016-04-02 12:20:00 1 2016-04-02 12:20:00 b v 0
1 2016-03-03 12:20:00 1 2016-03-02 12:20:00 e t 24
1 2016-03-01 10:09:00 NULL c NULL 26
1 2015-04-12 10:09:00 NULL d NULL 7802
NULL 2 2016-04-12 11:03:00 NULL x NULL