I have a large (circa 9500 rows) untidy dataset containing country names along with several variables and numerical output. I've made an example of the data frame as such:
country1 <- c("Arab World", "Caribbean small states", "Central Europe and the Baltics", "Australia", "Brazil", "Sweden")
indicator1 <- c("Age at first marriage, female", "Age at first marriage, male", "Birth rate, crude (per 1,000 people)", "Death rate, crude (per 1,000 people)", "Fertility rate, total (births per woman)", "Hospital beds (per 1,000 people)")
year1 <- c(1960,1961,1962,1963,1964,1965)
test <- data.frame(country=country1, indicator=indicator1, year=year1)
I need to extract a smaller data frame from this, that is filtered by only country names, e.g. "Sweden" and does not include agglomerations of countries, e.g. "Central Europe".
Would appreciate any assistance in this matter. I am quite new to R so not really sure where to begin, but I would imagine that I would first need to create a new data frame containing rows of all possible country names and then do a left join with my above test data frame. How would I go about getting that initial df of all countries?
Thanks.
You can either create your own list of valid country names or try extracting one from the {maps} package:
library(maps)
x <- map("world", plot = FALSE)
country_list <- x$names
I'd recommend manually inspecting to see if this list is up to date enough for your data.
Then subset based on this list of countries:
test_countries <- test[test$country %in% country_list, ]
which gives:
country indicator year
4 Australia Death rate, crude (per 1,000 people) 1963
5 Brazil Fertility rate, total (births per woman) 1964
6 Sweden Hospital beds (per 1,000 people) 1965