rstringdplyrmatchdata-wrangling

Pattern matching in a dataframe


I am having some trouble conducting pattern matching within a data frame. I am working with grepl function in R.

I have a data frame of 5 local districts in two years (2001 and 2002). I want to check if the district ruling party/ies aligns with the ruling party/ies nationally.

My model should assume that there is alignment when at least one of the nationally ruling parties also rules at the district level.

My initial data looks like this

df.1 <- data.frame(district= rep(c(1000:1004), times=2),
                  district.party= rep(c("PartyA-PartyB", "PartyA", "PartyB", "PartyC", "PartyA-PartyC"), times=2),
                  year= rep(2000:2001, each=5),
                  national.party= rep(c("PartyA|PartyB", "PartyA"), each=5))

> df.1
   district district.party year national.party
1      1000  PartyA-PartyB 2000  PartyA|PartyB
2      1001         PartyA 2000  PartyA|PartyB
3      1002         PartyB 2000  PartyA|PartyB
4      1003         PartyC 2000  PartyA|PartyB
5      1004  PartyA-PartyC 2000  PartyA|PartyB
6      1000  PartyA-PartyB 2001         PartyA
7      1001         PartyA 2001         PartyA
8      1002         PartyB 2001         PartyA
9      1003         PartyC 2001         PartyA
10     1004  PartyA-PartyC 2001         PartyA

Ideally, I want my new data frame to look like this

df.1.neat <- data.frame(district= rep(c(1000:1004), times=2),
                  district.party= rep(c("PartyA-PartyB", "PartyA", "PartyB", "PartyC", "PartyA-PartyC"), times=2),
                  year= rep(2000:2001, each=5),
                  national.party= rep(c("PartyA|PartyB", "PartyA"), each=5),
                  alignment= c("TRUE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", "TRUE"))

> df.1.neat
   district district.party year national.party alignment
1      1000  PartyA-PartyB 2000  PartyA|PartyB      TRUE
2      1001         PartyA 2000  PartyA|PartyB      TRUE
3      1002         PartyB 2000  PartyA|PartyB      TRUE
4      1003         PartyC 2000  PartyA|PartyB     FALSE
5      1004  PartyA-PartyC 2000  PartyA|PartyB      TRUE
6      1000  PartyA-PartyB 2001         PartyA      TRUE
7      1001         PartyA 2001         PartyA      TRUE
8      1002         PartyB 2001         PartyA     FALSE
9      1003         PartyC 2001         PartyA     FALSE
10     1004  PartyA-PartyC 2001         PartyA      TRUE

I am using grepl and dplyr

df.1.neat.OP <- df.1 %>% 
  mutate(alignment= grepl(national.coalition, county.party))

> df.1.neat.OP
   district  county.party year national.coalition alignment
1      1000 PartyA-PartyB 2000      PartyA|PartyB      TRUE
2      1001        PartyA 2000      PartyA|PartyB      TRUE
3      1002        PartyB 2000      PartyA|PartyB      TRUE
4      1003        PartyC 2000      PartyA|PartyB     FALSE
5      1004 PartyA-PartyC 2000      PartyA|PartyB      TRUE
6      1000 PartyA-PartyB 2001             PartyA      TRUE
7      1001        PartyA 2001             PartyA      TRUE
8      1002        PartyB 2001             PartyA      TRUE
9      1003        PartyC 2001             PartyA     FALSE
10     1004 PartyA-PartyC 2001             PartyA      TRUE

Note how my command works well for the year 2000 but computes the wrong outcome for district 1002 in 2001. There are loads of mistakes like this in my wider data frame.

any suggestions?


Solution

  • grepl() is not the right function for this use case. A native tidyverse solution using stringr::str_dectect():

    library(dplyr)
    library(stringr)
    
    df.1 <- data.frame(district = rep(c(1000:1004), times=2),
                       district.party = rep(c("PartyA-PartyB", "PartyA", "PartyB", "PartyC", "PartyA-PartyC"), times=2),
                       year = rep(2000:2001, each=5),
                       national.party = rep(c("PartyA|PartyB", "PartyA"), each=5))
    
    df.1.neat <- df.1 %>%
      mutate(alignment = str_detect(district.party, national.party))
    
    df.1.neat
    #    district district.party year national.party alignment
    # 1      1000  PartyA-PartyB 2000  PartyA|PartyB      TRUE
    # 2      1001         PartyA 2000  PartyA|PartyB      TRUE
    # 3      1002         PartyB 2000  PartyA|PartyB      TRUE
    # 4      1003         PartyC 2000  PartyA|PartyB     FALSE
    # 5      1004  PartyA-PartyC 2000  PartyA|PartyB      TRUE
    # 6      1000  PartyA-PartyB 2001         PartyA      TRUE
    # 7      1001         PartyA 2001         PartyA      TRUE
    # 8      1002         PartyB 2001         PartyA     FALSE
    # 9      1003         PartyC 2001         PartyA     FALSE
    # 10     1004  PartyA-PartyC 2001         PartyA      TRUE
    

    or to make grepl() work:

    df.1.neat <- df.1 |>
      rowwise() |>
      mutate(alignment = grepl(national.party, district.party)) |>
      ungroup()