sqlmariadbmariadb-10.1

mysql joining one table's data with other tables each row


I was trying to generate a report for a given date range from the following table.

table_columns =>   employee_id |date | status 

where status 1 = not_visited, 2 = visited, 3 = canceled, 4 = pending (for approval) The report should look like the following:

+-------------+------------+-------+-------------+---------+----------+---------+
| employee_id | date       | total | not_visited | visited | canceled | pending |
+-------------+------------+-------+-------------+---------+----------+---------+
|           3 | 2021-06-01 |    10 |          10 |       0 |        0 |       0 |
|           3 | 2021-06-02 |    22 |          10 |       2 |       10 |       0 |
|           3 | 2021-06-03 |    10 |          10 |       0 |        0 |       0 |
|           3 | 2021-06-05 |    11 |          10 |       1 |        0 |       0 |
|           4 | 2021-06-01 |    11 |           8 |       3 |        0 |       0 |
|           5 | 2021-06-01 |    10 |           1 |       9 |        0 |       0 |
+-------------+------------+-------+-------------+---------+----------+---------+

The query for this report is:

select va.employee_id, va.date,
       count(*) as total,
       sum(case when status = 1 then 1 else 0 end) as not_visited,
       sum(case when status = 2 then 1 else 0 end) as visited,
       sum(case when status = 3 then 1 else 0 end) as canceled,
       sum(case when status = 4 then 1 else 0 end) as pending
from visiting_addresses va
where va.date >= '2021-06-01'
  and va.date <= '2021-06-30'
group by va.employee_id, va.date;

If you look at the result, there is no entry for date 2021-06-04 for employee_id = 3. Also there is no data from 2021-06-06 to 2021-06-30 . I will have to include this dates on the result. So I tried to create another query that will generate dates between the given range. The following query will do that

SELECT gen_date
  FROM
    (SELECT v.gen_date
       FROM
         (SELECT ADDDATE('1970-01-01',t4 * 10000 + t3 * 1000 + t2 * 100 + t1 * 10 + t0) gen_date
            FROM
              (SELECT 0 t0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION 
       SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
           SELECT 8 UNION SELECT 9) t0,
          (SELECT 0 t1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
           SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
           SELECT 8 UNION SELECT 9) t1,
          (SELECT 0 t2 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
           SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
           SELECT 8 UNION SELECT 9) t2,
          (SELECT 0 t3 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
           SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
           SELECT 8 UNION SELECT 9) t3,
          (SELECT 0 t4 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
           SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
           SELECT 8 UNION SELECT 9) t4
          ) v
    WHERE v.gen_date BETWEEN '2021-06-01' AND '2021-06-30'
 ) calendar;

This query will generate dates like the following:

+------------+
| gen_date   |
+------------+
| 2021-06-01 |
| 2021-06-02 |
| 2021-06-03 |
| .......... |
| ...........|
| 2021-06-27 |
| 2021-06-28 |
| 2021-06-29 |
| 2021-06-30 |
+------------+

Now The question is, how do I join this above two queries in a way so that for each employee_id, all dates are present in the result? Or Is it even possible in this way? (The actual table contains 5 million rows. employee_id column has a cardinality of 3k++, date and employee_id columns are indexed)


Solution

  • You tagged both MySQL and MariaDB. These two DBMS are relatives, but they are still different DBMS. In MariaDB you can easily generate a series with the built-in seq:

    select date '2021-06-01' + interval seq day as date from seq_0_to_29
    

    In MySQL this is not avalable and you'd probably use a recursive query for this:

    with recursive dates (date) as
    (
      select date '2021-06-01'
      union all
      select date + interval 1 day
      from dates
      where date < date '2021-06-30'
    )
    

    In a recursive query you can of course generate the dates dynamically, e.g. for the last month in your table or, say, for the current and previous month.

    In any SQL dialect you can join queries. In your case you want all dates (generated as shown) combined with either all employees (by selecting from the employee table) or only with employees present in your visiting_addresses table. If you only want employees that have data in your table, use:

    select distinct employee_id from visiting_addresses
    

    In order to get all combinations you'll cross join the two data sets. Then you outer join the data from your table in order to also keep employees/dates without visits.

    The query format is:

    select
      employees.employee_id,
      dates.date,
      visits.total,
      visits.not_visited,
      ...
    from ( <date sequence query here> ) dates
    cross join ( <employee table query here> ) employees
    left outer join ( <visits table query here> ) visits
      on visits.date = dates.date
      and visits.employee_id = employees.employee_id
    order by employees.employee_id, dates.date;
    

    (If you want this for all employees, then simply replace ( <employee table query here> ) employees with the mere table name employees.

    For readability you may prefer WITH clauses:

    with recursive dates (date) as ( <date sequence query here> )
       , employees as ( <employee table query here> )
       , visits as ( <visits table query here> )
    select 
      employees.employee_id,
      dates.date,
      visits.total,
      visits.not_visited,
      ...
    from  dates
    cross join employees
    left outer join visits
      on visits.date = dates.date
      and visits.employee_id = employees.employee_id
    order by employees.employee_id, dates.date;
    

    You have mentioned that your table is quite big. I'd suggest the following index for this query:

    create index idx on visiting_addresses (date, employee_id, status);