I am relatively new to Stata and I currently have a Reddit dataset in cross-sectional format with each row representing a given Reddit post by a username, and with some usernames posting several times per day while others post only once/twice in the entire dataset.
* Example generated by -dataex-. For more info, type help dataex
clear
input float id str36 username int date
6 "(crash )" 19013
end
format %td date
I am interested in running a Heckman selection model, so I am trying to convert the data into a panel format, I created an ID variable per username as shown below:
egen id = group(username)
Then ran this to declare the data as panel following the guideline here
xtset id date
And I am receiving the following error: "repeated time values within panel" and I am not sure how to solve this because I believe in my case this is not problematic given that it's typical for social media users to post several times within the same day, which my time unit in this dataset.
If I ran the same code without the date
variable, the code works w/out any errors but my understanding is that I need to use both variables for a panel format.
You could use a timestamp to handle this. There is usually one available in session data. Just make sure to store it as a double:
. clear
. input byte id int date double ts
id date ts
1. 1 0 0
2. 1 0 1000
3. 1 0 2000
4. end
. format %td date
. format %tc ts
. list, clean noobs
id date ts
1 01jan1960 01jan1960 00:00:00
1 01jan1960 01jan1960 00:00:01
1 01jan1960 01jan1960 00:00:02
. xtset id ts
Panel variable: id (strongly balanced)
Time variable: ts, 01jan1960 00:00:00 to 01jan1960 00:00:02, but with gaps
Delta: .001 seconds
. xtset id date
repeated time values within panel
r(451);
Alternatively, collapse to user x date level if your analysis permits it.