I was trying to generate a report for a given date range from the following table.
table_columns => employee_id |date | status
where status 1 = not_visited, 2 = visited, 3 = canceled, 4 = pending (for approval) The report should look like the following:
+-------------+------------+-------+-------------+---------+----------+---------+
| employee_id | date | total | not_visited | visited | canceled | pending |
+-------------+------------+-------+-------------+---------+----------+---------+
| 3 | 2021-06-01 | 10 | 10 | 0 | 0 | 0 |
| 3 | 2021-06-02 | 22 | 10 | 2 | 10 | 0 |
| 3 | 2021-06-03 | 10 | 10 | 0 | 0 | 0 |
| 3 | 2021-06-05 | 11 | 10 | 1 | 0 | 0 |
| 4 | 2021-06-01 | 11 | 8 | 3 | 0 | 0 |
| 5 | 2021-06-01 | 10 | 1 | 9 | 0 | 0 |
+-------------+------------+-------+-------------+---------+----------+---------+
The query for this report is:
select va.employee_id, va.date,
count(*) as total,
sum(case when status = 1 then 1 else 0 end) as not_visited,
sum(case when status = 2 then 1 else 0 end) as visited,
sum(case when status = 3 then 1 else 0 end) as canceled,
sum(case when status = 4 then 1 else 0 end) as pending
from visiting_addresses va
where va.date >= '2021-06-01'
and va.date <= '2021-06-30'
group by va.employee_id, va.date;
If you look at the result, there is no entry for date 2021-06-04
for employee_id = 3. Also there is no data from 2021-06-06 to 2021-06-30 . I will have to include this dates on the result. So I tried to create another query that will generate dates between the given range. The following query will do that
SELECT gen_date
FROM
(SELECT v.gen_date
FROM
(SELECT ADDDATE('1970-01-01',t4 * 10000 + t3 * 1000 + t2 * 100 + t1 * 10 + t0) gen_date
FROM
(SELECT 0 t0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
SELECT 8 UNION SELECT 9) t0,
(SELECT 0 t1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
SELECT 8 UNION SELECT 9) t1,
(SELECT 0 t2 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
SELECT 8 UNION SELECT 9) t2,
(SELECT 0 t3 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
SELECT 8 UNION SELECT 9) t3,
(SELECT 0 t4 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION
SELECT 8 UNION SELECT 9) t4
) v
WHERE v.gen_date BETWEEN '2021-06-01' AND '2021-06-30'
) calendar;
This query will generate dates like the following:
+------------+
| gen_date |
+------------+
| 2021-06-01 |
| 2021-06-02 |
| 2021-06-03 |
| .......... |
| ...........|
| 2021-06-27 |
| 2021-06-28 |
| 2021-06-29 |
| 2021-06-30 |
+------------+
Now The question is, how do I join this above two queries in a way so that for each employee_id, all dates are present in the result? Or Is it even possible in this way? (The actual table contains 5 million rows. employee_id column has a cardinality of 3k++, date and employee_id columns are indexed)
You tagged both MySQL and MariaDB. These two DBMS are relatives, but they are still different DBMS. In MariaDB you can easily generate a series with the built-in seq
:
select date '2021-06-01' + interval seq day as date from seq_0_to_29
In MySQL this is not avalable and you'd probably use a recursive query for this:
with recursive dates (date) as
(
select date '2021-06-01'
union all
select date + interval 1 day
from dates
where date < date '2021-06-30'
)
In a recursive query you can of course generate the dates dynamically, e.g. for the last month in your table or, say, for the current and previous month.
In any SQL dialect you can join queries. In your case you want all dates (generated as shown) combined with either all employees (by selecting from the employee table) or only with employees present in your visiting_addresses table. If you only want employees that have data in your table, use:
select distinct employee_id from visiting_addresses
In order to get all combinations you'll cross join the two data sets. Then you outer join the data from your table in order to also keep employees/dates without visits.
The query format is:
select
employees.employee_id,
dates.date,
visits.total,
visits.not_visited,
...
from ( <date sequence query here> ) dates
cross join ( <employee table query here> ) employees
left outer join ( <visits table query here> ) visits
on visits.date = dates.date
and visits.employee_id = employees.employee_id
order by employees.employee_id, dates.date;
(If you want this for all employees, then simply replace ( <employee table query here> ) employees
with the mere table name employees
.
For readability you may prefer WITH
clauses:
with recursive dates (date) as ( <date sequence query here> )
, employees as ( <employee table query here> )
, visits as ( <visits table query here> )
select
employees.employee_id,
dates.date,
visits.total,
visits.not_visited,
...
from dates
cross join employees
left outer join visits
on visits.date = dates.date
and visits.employee_id = employees.employee_id
order by employees.employee_id, dates.date;
You have mentioned that your table is quite big. I'd suggest the following index for this query:
create index idx on visiting_addresses (date, employee_id, status);