data-modeling data-warehouse sql-data-warehouse

Data warehouse design for parking lot - date and time dimensions

I came across a data warehousing practice question. I tried to think of various ways to design this but I'm not sure what the best practices are. The question deals with designing a data warehouse for a parking lot and writing a SQL query for getting the parking fees.

The constraints are as follows:

Weekday hourly rates

Two wheeler - 1$

Four wheeler - 2$

Weekend hourly rates

Two wheeler - 2$

Four wheeler - 3$

A car is parked from Friday morning 9am till Saturday 10am. Design a data warehouse to stored this data and write a SQL to get the parking fees for a vehicle.

I could only think of below two ways of representing it,

Approach 1

Having a date_id, time_id and a type. Querying the parking fees can be difficult here since we do not have data at the grain of an hour. Difficult to calculate parking fees but consume less data

fact_parking_lot_data

fact_key	vehicle_id	date_id	time_id	type
1	1	20220506	9	in
2	1	20220507	22	out

Approach 2

Having a date_id, time_id for each hour of the day. This would created multiple fact table entries for the vehicle, if the vehicle is parked for 2 days then it would have 48 records. Easy to calculate parking fees but consume lot of storage

fact_parking_lot_data

fact_key	vehicle_id	date_id	time_id
1	1	20220506	9
2	1	20220506	10
3	1	20220506	11
4	1	20220506	12
.	.	.	.
.	.	.	.
.	.	.	.
26	1	20220507	10

Any thoughts or suggestion would be really appreciated. Thank you !

Solution

Your model is a clear cut example of an accumulating snapshot table: foreign keys to dimensions would be vehicle_id, date_in_id, date_out_id, time_in_id and time_out_id. And as a measure the duration of parking.

When a car comes in date_in_id and time_in_id are populated, but not date_out_id, time_out_id nor duration. When the car leaves then the date_out_id, time_out_id and duration are populated.

That gives you a natural metric to calculate: the total duration across a day, or several days.

The disadvantage of an accumulating snapshot is that it requires lookups on the fact table and updates on "out" events, but I'm guessing your fact table won't be too large (you don't have a location_id in your model so I'm assuming we're talking a few hundred cars per day, maybe up to a couple thousand).

If you're not happy with the accumulating snapshot then my preference between the two models you suggest goes to the 2nd one, where each hour is populated while the car is parked.

Now, a few remarks:

your model only takes into account the hours. You may want to have the time keys include minutes and seconds. Even if you're not using it now it gives you room to grow if in the future minute and second resolution are required (of course your second model gets a bit too large in this situation, so the accumulating snapshot becomes the obvious solution)
beware of daylight savings time and any other possible future timezone changes. If cars can be parked overnight and during the weekend you may have a parking event that starts on Saturday at 8pm and ends on Sunday at 8am that in fact lasts either 11 or 13 hours during the DST transitions. Choose a timezone and stick to it.
add a location_id. Sure, maybe now all your parking events happen on a single location, but in the future there may be others. Better to include now a dimension and not use it than having to add it later
add a vehicle class dimension. that dimension should be an SCD Type II with attributes such as the number of wheels and unit price. Upon price changes the new unit price is inserted and versioning ensures history is kept.

A disadvantage of the accumulating snapshot vs your 2nd model (with 1 row per time period): there's no easy way to count how many cars are parked at a given time. You'll have to count rows across the entire fact table where

(date_in_id < X or (date_in_id = X and time_in_id <= Y)) 
and 
(date_out_id > X or (date_out_id = X and time_out_id > Y))

Whereas your model allows you to quickly count how many cars are parked at any given moment by just counting all those with

date_id = X and time_id = Y