I am trying to run the following model of GDP on population:
GDP_(i,t) = alpha + beta*Population_(i,t) + epsilon
Here, each variable is indexed by time (t) and country (i).
I have a panel dataset df1 in a wide format like this:
UK_gdp <- c(4.1, 4.2, 3.8, 4.0)
US_gdp <- c(4.1, 4.2, 3.8, 4.0)
US_pop <- c(220, 230, 240, 260)
UK_pop <- c(40, 45, 47, 49)
year <- c("1965-01-01", "1966-01-01", "1967-01-01", "1968-01-01")
df1 <- tibble(UK_gdp, US_gdp, US_pop, UK_pop, year)
I would like to run the above regression using the columns UK_gdp, US_gdp for the GDP_(i,t) variable, and the columns US_pop, UK_pop as data for the population_(i,t) variable. Is there any way to use the data for both countries in a regression? I do not want to run separate regressions for each country, but instead include all the data in the model when run the regression. I am not sure how to do this.
First, you want to reshape
your data into long format.
> df1$year <- strftime(df1$year, '%Y') ## this leaves just year from the date
> df1_l <- reshape(df1, varying=list(c("UK_gdp", "US_gdp"), c("US_pop", "UK_pop")),
+ v.names=c('gdp', 'pop'), times=c('UK', 'US'), timevar='country',
+ idvar='year', direction='long') |> `rownames<-`(NULL)
> df1_l
year country gdp pop
1 1965 UK 4.1 220
2 1966 UK 4.2 230
3 1967 UK 3.8 240
4 1968 UK 4.0 260
5 1965 US 4.1 40
6 1966 US 4.2 45
7 1967 US 3.8 47
8 1968 US 4.0 49
The equation you show, is actually just an OLS regression, pooled across all entities and time periods.
GDP(i,t) = alpha + beta * Population(i,t) + epsilon
> fit1 <- lm(gdp ~ pop, df1_l)
> summary(fit1)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0338312297 0.1068459675 37.7537059 2.305458e-08
pop -0.0000624667 0.0006237552 -0.1001462 9.234907e-01
However, a better idea might be to use "country"
as a fixed effect, i.e.
GDP(i,t) = alpha_i + beta * Population(i,t) + epsilon(i,t)
> fit2 <- lfe::felm(gdp ~ pop | country, df1_l)
> summary(fit2)$coefficients
Estimate Std. Error t value Pr(>|t|)
pop -0.005082903 0.005734687 -0.8863435 0.4160214
Since error terms of countries are correlated, you probably should use clustered standard errors,
> fit3 <- lfe::felm(gdp ~ pop | country | 0 | country, df1_l)
> summary(fit3)$coefficients
Estimate Cluster s.e. t value Pr(>|t|)
pop -0.005082903 0.001638335 -3.10248 0.1985035
Finally, there might also be time trends (year effects), which would look like this (not working with this small example dataset):
GDP(i,t) = alpha_i + gamma_t + beta * Population(i,t) + epsilon(i,t)
> fit4 <- lfe::felm(gdp ~ pop | country + year | 0 | country, df1_l)
Data:
> dput(df1)
structure(list(UK_gdp = c(4.1, 4.2, 3.8, 4), US_gdp = c(4.1,
4.2, 3.8, 4), US_pop = c(220, 230, 240, 260), UK_pop = c(40,
45, 47, 49), year = c("1965-01-01", "1966-01-01", "1967-01-01",
"1968-01-01")), class = "data.frame", row.names = c(NA, -4L))