I have a dataframe mydf
like this:
| Country | Year |
| ---------- | ---- |
| Bahamas | 1982 |
| Chile | 1817 |
| Cuba | 1960 |
| Finland | 1918 |
| Kazakhstan | 1993 |
etc., with many more rows.
Is there an easy way to plot the cumulative number of unique countries over time? In other words,
Year
(a timeline), andI tried stat_ecdf()
, but the y-axis does not show the absolute count of countries:
ggplot(mydata, aes(x = Year)) + stat_ecdf()
This is an example of a mydf
:
> dput(mydf)
structure(list(Country = c("Moldova", "Aragon", "Abu Dhabi",
"Uzbekistan", "Sweden", "Anhalt", "Saudi Arabia", "Montenegro",
"Central African Republic", "Bulgaria", "Argentina", "Senegal",
"Sri Lanka", "Cambodia", "Benin", "Colombia", "Algeria", "Iraq",
"DPRK", "Italy"), Year = c(1992L, 1223L, 1966L, 1993L, 1748L,
1835L, 1955L, 1841L, 1959L, 1993L, 1806L, 1960L, 1955L, 1995L,
1892L, 1914L, 1981L, 1958L, 1948L, 1900L)), row.names = c(NA,
-20L), class = c("data.table", "data.frame"))
Give the countries an ID number based on first appearance, and then the cumulative count is the same as the cumulative max of that ID:
mydf = mydf[order(mydf$Year, mydf$Country), ]
mydf$country_id = as.integer(factor(mydf$Country, levels = unique(mydf$Country)))
mydf$cum_n_country = cummax(mydf$country_id)
If years are repeated, you'll need to aggregate/summarize the max cum_n_country
by year.
library(dplyr)
library(ggplot2)
mydf %>%
group_by(Year) %>%
summarize(cum_n_country = max(cum_n_country)) %>%
ggplot(aes(x = Year, y = cum_n_country)) +
geom_line()