mysqlquery-help

Calculate sum of absolute difference in rows by groups


I have a problem which I have been able to solve using Stata, but now my data has grown to a size that I cannot process it from memory any more. I hope to do this in MySQL instead. I am trying to calculate the Manhattan distance of items between n groups. I have manipulated the data so far that I hope it is ready to work for the computation:

SELECT * FROM exampleshares;

+----------+-------------+-------------+
| item     | group       | share       |
+----------+-------------+-------------+
| A        | group1      |  .3         |
| B        | group1      |  .7         |
| A        | group2      |  .2         |
| B        | group2      |  .6         |
| C        | group2      |  .2         |
| A        | group3      |  .3         |
| C        | group3      |  .6         |
+----------+-------------+-------------+

The Manhattan Distance of this example would be :

+----------+-------------+-------------+
| groupX   | groupY      | M distance  |
+----------+-------------+-------------+
| group1   | group1      | 0           |
| group1   | group2      |  .4         |
| group1   | group3      | 1.3         |
| group2   | group1      |  .4         |
| group2   | group2      | 0           |
| group2   | group3      | 1.1         |
| group3   | group1      | 1.3         |
| group3   | group2      | 1.1         |
| group3   | group3      | 0           |
+----------+-------------+-------------+

For example, the distance between group1 and group2 is computed as |.3-.2|+|.7-.6|+|0-.2|=0.4, ie. the sum of the absolute difference in shares. How do I do this in MySQL?

During my search if found a couple of solutions to calculating the difference to a previous row by group, but nothing to what I am looking for specifically.


Solution

  • I believe you will have to use a stored routine or some other script to accomplish this. Here is a stored routine that will do it:

    delimiter //
    drop procedure if exists manhattanDistance//
    create procedure manhattanDistance (in startGroup char(32), in endGroup char(32), out manhattanDistance decimal(2,1))
        not deterministic
        reads sql data
    begin
      drop table if exists tmp_items;
      create temporary table tmp_items as select distinct item from exampleshares;
    
      select sum(abs(ifnull(es1.share, 0) - ifnull(es2.share, 0))) into manhattanDistance
        from tmp_items ti
        left join exampleshares es1 on es1.item = ti.item and es1.group = startGroup
        left join exampleshares es2 on es2.item = ti.item and es2.group = endGroup;
    end//
    delimiter ;
    
    call manhattanDistance('group1', 'group2', @distanceBetweenGroup1And2);
    select @distanceBetweenGroup1And2;