I have a dataset that tracks user activity, and I need to identify consecutive rows where a specific condition is met. The table structure and sample data are as follows:
UserID | Activity | Timestamp |
---|---|---|
1 | Login | 2023-11-01 08:00:00 |
1 | PageView | 2023-11-01 08:10:00 |
1 | Login | 2023-11-01 08:15:00 |
1 | Logout | 2023-11-01 08:20:00 |
2 | Login | 2023-11-01 09:00:00 |
2 | Login | 2023-11-01 09:05:00 |
2 | PageView | 2023-11-01 09:10:00 |
I want to identify cases where consecutive rows for the same user have the same Activity, along with their timestamps. For example:
For UserID = 2, there are two consecutive Login activities. Like this.
UserID | Activity | StartTimestamp | EndTimestamp |
---|---|---|---|
2 | Login | 2023-11-01 09:00:00 | 2023-11-01 09:05:00 |
I have attempted using a self-join to compare each row with the previous one, but it becomes computationally expensive for large datasets. I’ve also tried using LAG
and LEAD
functions, but I’m struggling to group consecutive rows efficiently.
Here is a possible solution using LAG. This solution assumes no more then 2 duplicates per UserID+Activity.
WITH CTE AS
(
SELECT UserID, Activity,
LAG(USERID) OVER (ORDER BY TimeStamp) as PrevID,
LAG(Activity) OVER (ORDER BY TimeStamp) as PrevActivity,
LAG(TimeStamp) OVER (ORDER BY TimeStamp) as StartTimeStamp,
TimeStamp as EndTimeStamp
FROM Example
ORDER BY TimeStamp
)
SELECT UserID, Activity, StartTimeStamp, EndTimeStamp
FROM CTE
WHERE UserID=PrevID AND Activity=PrevActivity
ORDER BY StartTimeStamp
UserID | Activity | StartTimeStamp | EndTimeStamp |
---|---|---|---|
2 | Login | 2023-11-01 09:00:00 | 2023-11-01 09:05:00 |