I have a huge data in this form and with more other columns. So I have a list of people working in a country in 2011 and moved to another one in 2012.
Name Work_{2011} Work_{2012} Wage_{2011} Wage_{2012}
Jack US UK 5387 35353
Bill US UK 43534 5343
Emma US FRANCE 34534 53455
Brand US FRANCE 64545 1343
Luigui US FRANCE 15343 3144
Ella US FRANCE 64545 1343
Lucie France SPAIN 84545 1343
Maria France SPAIN 984545 1343
Grec Italy US 4545 1343
I want to keep the observations having the biggest share of destination for each departure. I want:
Name Work_{2011} Work_{2012} Wage_{2011} Wage_{2012}
Emma US FRANCE 34534 53455
Brand US FRANCE 64545 1343
Luigui US FRANCE 15343 3144
Ella US FRANCE 64545 1343
Lucie France SPAIN 84545 1343
Maria France SPAIN 984545 1343
Grec Italy US 4545 1343
I'm not 100% sure this will meet your needs, but perhaps it will be helpful for you. It might help to know more details about your data, including how large your dataset is, how your columns are organized by year, etc.
In this example, you can use dplyr
from tidyverse
. First, you can group_by
Work_2011
(I removed the braces from column names), and filter
where the number of distinct values for Work_2012
is greater than 1. This would imply multiple destinations.
Second, you can group_by
both Work_2011
and Work_2012
to determine the number of countries for each destination. This will be helpful in a second filter
.
Again, please let me know if this is the direction you were interested in.
library(dplyr)
df %>%
group_by(Work_2011) %>%
filter(n_distinct(Work_2012) > 1) %>%
group_by(Work_2011, Work_2012) %>%
mutate(numctry = n()) %>%
group_by(Work_2011) %>%
filter(numctry == max(numctry))
Output
Name Work_2011 Work_2012 numctry
<chr> <chr> <chr> <int>
1 Emma US FRANCE 4
2 Brand US FRANCE 4
3 Luigui US FRANCE 4
4 Ella US FRANCE 4
Edit (1/13/21): Based on edited question, we can simplify the code further.
Start by calculating the number of destinations per country, we'll call this dest_per_cntry
. This will be a new column. For Jack and Bill, it will be 2. For Emma, Brand, Luigui, and Ella, it will be 4.
Then, you can group_by
to consider the 2011 country only. For each country in the Work_2011
column, keep (or filter
) only those where the dest_per_cntry
is the same as the maximum number of destinations for that country. Note that if there are "ties", all rows with the maximum destinations will still be kept.
library(tidyverse)
df %>%
group_by(Work_2011, Work_2012) %>%
mutate(dest_per_cntry = n()) %>%
group_by(Work_2011) %>%
filter(dest_per_cntry == max(dest_per_cntry))
Output
Name Work_2011 Work_2012 Wage_2011 Wage_2012 dest_per_cntry
<chr> <chr> <chr> <int> <int> <int>
1 Emma US FRANCE 34534 53455 4
2 Brand US FRANCE 64545 1343 4
3 Luigui US FRANCE 15343 3144 4
4 Ella US FRANCE 64545 1343 4
5 Lucie France SPAIN 84545 1343 2
6 Maria France SPAIN 984545 1343 2
7 Grec Italy US 4545 1343 1
Data
df <- structure(list(Name = c("Jack", "Bill", "Emma", "Brand", "Luigui",
"Ella", "Lucie", "Maria", "Grec"), Work_2011 = c("US", "US",
"US", "US", "US", "US", "France", "France", "Italy"), Work_2012 = c("UK",
"UK", "FRANCE", "FRANCE", "FRANCE", "FRANCE", "SPAIN", "SPAIN",
"US"), Wage_2011 = c(5387L, 43534L, 34534L, 64545L, 15343L, 64545L,
84545L, 984545L, 4545L), Wage_2012 = c(35353L, 5343L, 53455L,
1343L, 3144L, 1343L, 1343L, 1343L, 1343L)), class = "data.frame", row.names = c(NA,
-9L))