I'm working with a dataset that contains hourly data, and I'm having trouble interpreting how the hours are represented. The dataset has two columns: one for the date and one for the hour. Here's an example of how the data is structured:
Date | Hour |
---|---|
2023-01-01 | 1 |
2023-01-01 | 2 |
2023-01-01 | 3 |
... | ... |
2023-01-01 | 24 |
I need to understand whether "2023-01-01 hour 1" refers to:
I haven't found the answer to this question anywhere, so I would like to start a discussion and find out what your interpretation and good practices are in this matter
Any insights or suggestions on how to approach this issue would be greatly appreciated!
Is there a standard convention for representing hourly data in such tables?
The idea that a value equals to itself is kinda standard. So if there is no contradictory evidence, then the null hypothesis is that 1 means 1:00, 2 means 2:00, ... 24 means 00:00
H0 or the null hypothesis is the hypothesis you are to make if you have to choose a hypothesis and you have no evidence that would debunk any of the possibilities, nor arguments that would change their likelihood.
So, keep in mind that we need to assume that 24 is by default meaning 00:00 unless something suggests otherwise. But you will need to investigate (and you already did it to some extent) in order to be sure.
As you also suggested, documentation is a great place to find any evidence that would shift your position towards a direction. You also stated that the documentation did not say what the 24 means. But you can look for secondary evidence. For example, if your documentation says that in order to sort the times you need to run something like
select `Date`, `Hour`
from t
order by `Date`, `Hour`
and you have Hour
values of 24, then, since 24 > 23, it will become clear that 24 is after 23, so 24 must mean the 24th hour, that is, 23:00 and consequently each hour means one less than itself.
Note that I'm not saying that this is the actual case, I'm just saying that if the hours are numerically sorted to sort by time, then 24 means 00:00.
Look into your application code, look how the hours are ordered, look at what comments are there, whether the code suggests that an hour is at the start of the day. Also look into the way data is generated. If you have something like:
insert into t(`Date`, `Hour`)
values
('2024-08-08', 1),
('2024-08-08', 2),
('2024-08-08', 3),
('2024-08-08', 4),
('2024-08-08', 5),
('2024-08-08', 6),
('2024-08-08', 7),
('2024-08-08', 8),
('2024-08-08', 9),
('2024-08-08', 10),
('2024-08-08', 11),
('2024-08-08', 12),
('2024-08-08', 13),
('2024-08-08', 14),
('2024-08-08', 15),
('2024-08-08', 16),
('2024-08-08', 17),
('2024-08-08', 18),
('2024-08-08', 19),
('2024-08-08', 20),
('2024-08-08', 21),
('2024-08-08', 22),
('2024-08-08', 23),
('2024-08-08', 24);
then the 24 most likely means 23:00. However, if you have:
insert into t(`Date`, `Hour`)
values
('2024-08-08', 24),
('2024-08-08', 1),
('2024-08-08', 2),
('2024-08-08', 3),
('2024-08-08', 4),
('2024-08-08', 5),
('2024-08-08', 6),
('2024-08-08', 7),
('2024-08-08', 8),
('2024-08-08', 9),
('2024-08-08', 10),
('2024-08-08', 11),
('2024-08-08', 12),
('2024-08-08', 13),
('2024-08-08', 14),
('2024-08-08', 15),
('2024-08-08', 16),
('2024-08-08', 17),
('2024-08-08', 18),
('2024-08-08', 19),
('2024-08-08', 20),
('2024-08-08', 21),
('2024-08-08', 22),
('2024-08-08', 23);
then 24 most likely means 00:00.
Ask around people who might know
If you have formulas that convert timezones, then you can often infer the meaning of the input from the formula.
If no further information, choose H0, which is that 1 = 1 is true. But look for contradictory evidence, try to debunk this hypothesis. If you fail to debunk it, then it is the best bet. If you manage to debunk it, then choose the better hypothesis instead.