mysqlsqlcircos

Calculating overlap in MySQL


I'm trying to find out which classes have the most overlap between them. The data are stored in MySQL and each student has a completely separate row in the database for each class that he/she takes (I didn't configure it and I can't change it). I pasted a simplified version of the table below. In reality there are around 20 different courses.

CREATE TABLE classes
(`student_id` int, `class` varchar(13));
INSERT INTO classes
(`student_id`, `class`)
VALUES
(55421, 'algebra'),
(27494, 'algebra'),
(64934, 'algebra'),
(65364, 'algebra'),
(21102, 'algebra'),
(90734, 'algebra'),
(20103, 'algebra'),
(57450, 'gym'),
(76411, 'gym'),
(24918, 'gym'),
(65364, 'gym'),
(55421, 'gym'),
(89607, 'world_history'),
(54522, 'world_history'),
(49581, 'world_history'),
(84155, 'world_history'),
(55421, 'world_history'),
(57450, 'world_history');

I eventually want to use Circos (background here) but I'd be happy with any method that allowed me to understand and show people where there is the most and least overlap. This is off the top of my head, but I was thinking that I could use an output table with one row and one column for each course and the number of overlaps listed where different classes intersect. Where each course intersected with itself could show the number of people who have no overlap with any other category.

Screenshot of a 3x3 matrix from Excel


Solution

  • You can do that by generating results to represent links : src -> dst = nb

    1) Get matrix

    select c1.class src_class, c2.class dst_class
    from (select distinct class from classes) c1
    join (select distinct class from classes) c2
    order by src_class, dst_class
    

    The "select distinct class" is not necessary to generate matrix, you can just directly select classes and GROUP BY. But, at step 2 we need that unique results.

    Result :

    src_class      dst_class
    -----------------------------
    algebra        algebra
    algebra        gym
    algebra        world_history
    gym            algebra
    gym            gym
    gym            world_history
    world_history  algebra
    world_history  gym
    world_history  world_history
    

    2) Join list of students that match the source and destination

    select c1.class src_class, c2.class dst_class, count(v.student_id) overlap
    from (select distinct class from classes) c1
    join (select distinct class from classes) c2
    left join classes v on
    (
        v.class = c1.class
        and v.student_id in (select student_id from classes
                             where class = c2.class)
    )
    group by src_class, dst_class
    order by src_class, dst_class
    

    The distinct values (step 1) allow us to get all classes, even if they are no links (and put 0 instead).

    Result :

    src_class      dst_class      overlap
    -------------------------------------
    algebra        algebra           7
    algebra        gym               2
    algebra        world_history     1
    gym            algebra           2
    gym            gym               5
    gym            world_history     2
    world_history  algebra           1
    world_history  gym               2
    world_history  world_history     6
    

    3 - Make a different calcul if classes are equals

    select c1.class src_class, c2.class dst_class, count(v.student_id) overlap
    from (select distinct class from classes) c1
    join (select distinct class from classes) c2
    left join classes v on
    (
        v.class = c1.class and
        (
            -- When classes are equals
            -- Students presents only in that class
            (c1.class = c2.class
             and 1 = (select count(*) from classes
                      where student_id = v.student_id))
        or
            -- When classes are differents
            -- Students present in both classes
            (c1.class != c2.class
             and v.student_id in (select student_id from classes
                                  where class = c2.class))
        )
    )
    group by src_class, dst_class
    order by src_class, dst_class
    

    Result :

    src_class      dst_class      overlap
    -------------------------------------
    algebra        algebra           5
    algebra        gym               2
    algebra        world_history     1
    gym            algebra           2
    gym            gym               2
    gym            world_history     2
    world_history  algebra           1
    world_history  gym               2
    world_history  world_history     4