rregexstringcounting

Correctly count elements in comma-separated strings, as well as with 'and' and "and/or" in R, excluding certain cases


I have a dataframe that has a column that contains multiple Spanish words. What I want is to count the total number of elements that each row contains. I have the following dataframe as an example:

bd_universal <- data.frame(
  cartel = c(
    "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco",  
    "Cártel Beltran Leyva, Cártel del Pacífico",                  
    "Cártel de Sinaloa y/o Pacífico",                               
    "Leyva y/o Grupo",                                           
    "A, B, C y D",                                                 
    "Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix", 
    "A (B y C), D",                                                
    "Leyva, Mayo y Junio Agosto",                                         
    "R (T y P), S, H y/o L")

The total number of values that each row contains is distinguished by three things: the "y" that separates the last word/s ("y" is "and" in English), the ",", and the "y/o" ("y/o" is "and/or" in English). What I want is to create a new column called "total" that counts the number elements that are separated by these factos, except when they're inside parenthesis. So, the resulting data frame would look like this:

cartel total
Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco 2
----------------------------------------------------------------- --------
Cártel Beltran Leyva, Cártel del Pacífico 2
----------------------------------------------------------------- --------
Cártel de Sinaloa y/o Pacífico 2
----------------------------------------------------------------- --------
Leyva y/o Grupo 2
-------------------------------------------------------------- --------
A, B, C y D 4
----------------------------------------------------------------- --------
Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco
Nueva Generación, Cártel de Arellano Félix 3
----------------------------------------------------------------- --------
A (B y C), D 2
----------------------------------------------------------------- --------
Leyva, Mayo y Junio Agosto 3
----------------------------------------------------------------- --------
R (T y/o P), S, H y/o L 4
----------------------------------------------------------------- --------

Does anyone know how to do this?

I have tried the following code, but it did not count the correct number of elements for each row:

bd_universal$total <- sapply(as.character(bd_universal$cartel), function(x) {

  x <- gsub("\\(.*?\\)", "", x)

  x <- gsub("y/o", ",y_o,", x)

  x <- gsub("-", " ", x)
  
  x <- gsub("(?<=\\w)\\s*y\\s*(?=\\w)", ",y", x, perl = TRUE)

  x <- gsub(",y_o,", "y/o", x)
  
  elementos <- unlist(strsplit(x, ","))

  elementos <- trimws(elementos) 
  elementos <- elementos[elementos != "Sin registro" & !is.na(elementos) & elementos != ""]
  
  elementos <- gsub("\\s*-\\s*", "", elementos)

  return(length(elementos))
})

With this code, values like "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco" are counted as 3, even though they are, given what I look for, only 2.

Does anybody know how to solve this problem? Thanks!


Solution

  • An approach with minimal regex, first removing parentheses (...), relying on the fact that these are always closed. Then giving strsplit all split arguments. Finally getting the vector lengths.

    transform(bd_universal, total = 
      lengths(strsplit(sub("\\(.*\\)", "", cartel), ",|y/o| y ")))
    

    output

                                                                                                     cartel
    1                                           Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco
    2                                                             Cártel Beltran Leyva, Cártel del Pacífico
    3                                                                        Cártel de Sinaloa y/o Pacífico
    4                                                                                       Leyva y/o Grupo
    5                                                                                           A, B, C y D
    6 Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix
    7                                                                                          A (B y C), D
    8                                                                            Leyva, Mayo y Junio Agosto
    9                                                                                 R (T y P), S, H y/o L
      total
    1     2
    2     2
    3     2
    4     2
    5     4
    6     3
    7     2
    8     3
    9     4
    

    Note, if you have multiple (...) within one vector replace sub(... with gsub("\\([ [:alnum:]/]*\\)", "", cartel)