I have a dataframe ‘dfa’ of affiliations that contains city names, for which the country is sometimes missing, e.g. like rows 4 (BAGHDAD) and 7 (BERLIN):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
"DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
"DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
"COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
"DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
"LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
"DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
"INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
"DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
"DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
"DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
"LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))
I have now a second dataframe ‘dfb’ that contains a list of cities and corresponding country, some of which are present in 'dfa':
dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
How can I add cities and corresponding countries in two new columns only for cities that are present in both ‘dfa’ and ‘dfb’ (and even when the country is missing, as for BAGHDAD and BERLIN)?
NB: the goal is to add full city names but not part of them. Below in row 7, an example of what is not wanted: the AGRI city of TURKEY is inappropriately associated with BERLIN because this row includes the 'AGRICULTURE' word.
Is there a simple way to do that, ideally using dplyr?
affiliation city country
1 DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA. BEIJING CHINA
This approach accounts for the possibility that an affiliation could match more than one city name.
library(tidyverse)
dfa %>%
mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb, by = "city") %>%
mutate(country_match = str_detect(affiliation, country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY, … NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
You could then double-check cases with 1 nmatches
but country_match == F
, and when there are 2 or more nmatches
you can keep the one with country_match == T
.