rdplyrtext-miningcountrycity

Find city names within affiliations and add them with their corresponding countries in new columns of a dataframe


I have a dataframe ‘dfa’ of affiliations that contains city names, for which the country is sometimes missing, e.g. like rows 4 (BAGHDAD) and 7 (BERLIN):

dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS",
                                "DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.",
                                "DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES",
                                "COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.",
                                "DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.",
                                "LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.",
                                "DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN",
                                "INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.",
                                "DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.",
                                "DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN",
                                "DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.",
                                "LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA."))

I have now a second dataframe ‘dfb’ that contains a list of cities and corresponding country, some of which are present in 'dfa':

dfb <- data.frame(city=c("AGRI","AMSTERDAM","ATHENS","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDAPEST"),
                  country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))

How can I add cities and corresponding countries in two new columns only for cities that are present in both ‘dfa’ and ‘dfb’ (and even when the country is missing, as for BAGHDAD and BERLIN)?

NB: the goal is to add full city names but not part of them. Below in row 7, an example of what is not wanted: the AGRI city of TURKEY is inappropriately associated with BERLIN because this row includes the 'AGRICULTURE' word.

Is there a simple way to do that, ideally using dplyr?

    affiliation      city     country
1      DEPARTMENT OF PHARMACY, AMSTERDAM UNIVERSITY, AMSTERDAM, THE NETHERLANDS AMSTERDAM NETHERLANDS
2  DEPARTMENT OF BIOCHEMISTRY, LADY HARDINGE MEDICAL COLLEGE, NEW DELHI, INDIA.      <NA>        <NA>
3      DEPARTMENT OF PATHOLOGY, CHILDREN'S HOSPITAL, LOS ANGELES, UNITED STATES      <NA>        <NA>
4                 COLLEGE OF EDUCATION FOR PURE SCIENCE, UNIVERSITY OF BAGHDAD.   BAGHDAD        IRAQ
5  DEPARTMENT OF CLINICAL LABORATORY, BEIJING GENERAL HOSPITAL, BEIJING, CHINA.   BEIJING       CHINA
6           LABORATORY OF MOLECULAR BIOLOGY, ISTITUTO ORTOPEDICO, MILAN, ITALY.      <NA>        <NA>
7                 DEPARTMENT OF AGRICULTURE, BERLIN INSTITUTE OF HEALTH, BERLIN      AGRI      TURKEY
8       INSTITUTE OF LABORATORY MEDICINE, UNIVERSITY HOSPITAL, MUNICH, GERMANY.      <NA>        <NA>
9      DEPARTMENT OF CLINICAL PATHOLOGY, MAHIDOL UNIVERSITY, BANGKOK, THAILAND.   BANGKOK    THAILAND
10                       DEPARTMENT OF BIOLOGY, WASEDA UNIVERSITY, TOKYO, JAPAN      <NA>        <NA>
11           DEPARTMENT OF MOLECULAR BIOLOGY, MINISTRY OF HEALTH, TEHRAN, IRAN.      <NA>        <NA>
12        LABORATORY OF CARDIOVASCULAR DISEASE, FUWAI HOSPITAL, BEIJING, CHINA.   BEIJING       CHINA

Solution

  • This approach accounts for the possibility that an affiliation could match more than one city name.

    library(tidyverse)
    
    dfa %>% 
      mutate(city = map(affiliation, ~ str_extract(.x, dfb$city))) %>% 
      unnest(cols = c(city)) %>% 
      group_by(affiliation) %>% 
      mutate(nmatches = sum(!is.na(city))) %>% 
      filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
      ungroup() %>% 
      left_join(dfb, by = "city") %>% 
      mutate(country_match = str_detect(affiliation, country))
    
    # A tibble: 12 x 5
       affiliation              city   nmatches country country_match
       <chr>                    <chr>     <int> <chr>   <lgl>        
     1 DEPARTMENT OF PHARMACY,… AMSTE…        1 NETHER… TRUE         
     2 DEPARTMENT OF BIOCHEMIS… NA            0 NA      NA           
     3 DEPARTMENT OF PATHOLOGY… NA            0 NA      NA           
     4 COLLEGE OF EDUCATION FO… BAGHD…        1 IRAQ    FALSE        
     5 DEPARTMENT OF CLINICAL … BEIJI…        1 CHINA   TRUE         
     6 LABORATORY OF MOLECULAR… NA            0 NA      NA           
     7 BERLIN INSTITUTE OF HEA… BERLIN        1 GERMANY FALSE        
     8 INSTITUTE OF LABORATORY… NA            0 NA      NA           
     9 DEPARTMENT OF CLINICAL … BANGK…        1 THAILA… TRUE         
    10 DEPARTMENT OF BIOLOGY, … NA            0 NA      NA           
    11 DEPARTMENT OF MOLECULAR… NA            0 NA      NA           
    12 LABORATORY OF CARDIOVAS… BEIJI…        1 CHINA   TRUE   
    

    You could then double-check cases with 1 nmatches but country_match == F, and when there are 2 or more nmatches you can keep the one with country_match == T.