I'm trying to find out which classes have the most overlap between them. The data are stored in MySQL and each student has a completely separate row in the database for each class that he/she takes (I didn't configure it and I can't change it). I pasted a simplified version of the table below. In reality there are around 20 different courses.
CREATE TABLE classes
(`student_id` int, `class` varchar(13));
INSERT INTO classes
(`student_id`, `class`)
VALUES
(55421, 'algebra'),
(27494, 'algebra'),
(64934, 'algebra'),
(65364, 'algebra'),
(21102, 'algebra'),
(90734, 'algebra'),
(20103, 'algebra'),
(57450, 'gym'),
(76411, 'gym'),
(24918, 'gym'),
(65364, 'gym'),
(55421, 'gym'),
(89607, 'world_history'),
(54522, 'world_history'),
(49581, 'world_history'),
(84155, 'world_history'),
(55421, 'world_history'),
(57450, 'world_history');
I eventually want to use Circos (background here) but I'd be happy with any method that allowed me to understand and show people where there is the most and least overlap. This is off the top of my head, but I was thinking that I could use an output table with one row and one column for each course and the number of overlaps listed where different classes intersect. Where each course intersected with itself could show the number of people who have no overlap with any other category.
You can do that by generating results to represent links : src -> dst = nb
select c1.class src_class, c2.class dst_class
from (select distinct class from classes) c1
join (select distinct class from classes) c2
order by src_class, dst_class
The "select distinct class" is not necessary to generate matrix, you can just directly select classes and GROUP BY. But, at step 2 we need that unique results.
Result :
src_class dst_class
-----------------------------
algebra algebra
algebra gym
algebra world_history
gym algebra
gym gym
gym world_history
world_history algebra
world_history gym
world_history world_history
select c1.class src_class, c2.class dst_class, count(v.student_id) overlap
from (select distinct class from classes) c1
join (select distinct class from classes) c2
left join classes v on
(
v.class = c1.class
and v.student_id in (select student_id from classes
where class = c2.class)
)
group by src_class, dst_class
order by src_class, dst_class
The distinct values (step 1) allow us to get all classes, even if they are no links (and put 0 instead).
Result :
src_class dst_class overlap
-------------------------------------
algebra algebra 7
algebra gym 2
algebra world_history 1
gym algebra 2
gym gym 5
gym world_history 2
world_history algebra 1
world_history gym 2
world_history world_history 6
select c1.class src_class, c2.class dst_class, count(v.student_id) overlap
from (select distinct class from classes) c1
join (select distinct class from classes) c2
left join classes v on
(
v.class = c1.class and
(
-- When classes are equals
-- Students presents only in that class
(c1.class = c2.class
and 1 = (select count(*) from classes
where student_id = v.student_id))
or
-- When classes are differents
-- Students present in both classes
(c1.class != c2.class
and v.student_id in (select student_id from classes
where class = c2.class))
)
)
group by src_class, dst_class
order by src_class, dst_class
Result :
src_class dst_class overlap
-------------------------------------
algebra algebra 5
algebra gym 2
algebra world_history 1
gym algebra 2
gym gym 2
gym world_history 2
world_history algebra 1
world_history gym 2
world_history world_history 4