I have a dataframe that has a column that contains multiple Spanish words. What I want is to count the total number of elements that each row contains. I have the following dataframe as an example:
bd_universal <- data.frame(
cartel = c(
"Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco",
"Cártel Beltran Leyva, Cártel del Pacífico",
"Cártel de Sinaloa y/o Pacífico",
"Leyva y/o Grupo",
"A, B, C y D",
"Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix",
"A (B y C), D",
"Leyva, Mayo y Junio Agosto",
"R (T y P), S, H y/o L")
The total number of values that each row contains is distinguished by three things: the "y" that separates the last word/s ("y" is "and" in English), the ",", and the "y/o" ("y/o" is "and/or" in English). What I want is to create a new column called "total" that counts the number elements that are separated by these factos, except when they're inside parenthesis. So, the resulting data frame would look like this:
cartel | total |
---|---|
Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco | 2 |
----------------------------------------------------------------- | -------- |
Cártel Beltran Leyva, Cártel del Pacífico | 2 |
----------------------------------------------------------------- | -------- |
Cártel de Sinaloa y/o Pacífico | 2 |
----------------------------------------------------------------- | -------- |
Leyva y/o Grupo | 2 |
-------------------------------------------------------------- | -------- |
A, B, C y D | 4 |
----------------------------------------------------------------- | -------- |
Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco | |
Nueva Generación, Cártel de Arellano Félix | 3 |
----------------------------------------------------------------- | -------- |
A (B y C), D | 2 |
----------------------------------------------------------------- | -------- |
Leyva, Mayo y Junio Agosto | 3 |
----------------------------------------------------------------- | -------- |
R (T y/o P), S, H y/o L | 4 |
----------------------------------------------------------------- | -------- |
Does anyone know how to do this?
I have tried the following code, but it did not count the correct number of elements for each row:
bd_universal$total <- sapply(as.character(bd_universal$cartel), function(x) {
x <- gsub("\\(.*?\\)", "", x)
x <- gsub("y/o", ",y_o,", x)
x <- gsub("-", " ", x)
x <- gsub("(?<=\\w)\\s*y\\s*(?=\\w)", ",y", x, perl = TRUE)
x <- gsub(",y_o,", "y/o", x)
elementos <- unlist(strsplit(x, ","))
elementos <- trimws(elementos)
elementos <- elementos[elementos != "Sin registro" & !is.na(elementos) & elementos != ""]
elementos <- gsub("\\s*-\\s*", "", elementos)
return(length(elementos))
})
With this code, values like "Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco" are counted as 3, even though they are, given what I look for, only 2.
Does anybody know how to solve this problem? Thanks!
An approach with minimal regex, first removing parentheses (...)
, relying on the fact that these are always closed. Then giving strsplit
all split arguments. Finally getting the vector lengths
.
transform(bd_universal, total =
lengths(strsplit(sub("\\(.*\\)", "", cartel), ",|y/o| y ")))
output
cartel
1 Cártel del Pacífico - Fracción Mayo Zambada, Cártel Jalisco
2 Cártel Beltran Leyva, Cártel del Pacífico
3 Cártel de Sinaloa y/o Pacífico
4 Leyva y/o Grupo
5 A, B, C y D
6 Cártel del Pacífico - Fracción Los Menores, Cártel Jalisco Nueva Generación, Cártel de Arellano Félix
7 A (B y C), D
8 Leyva, Mayo y Junio Agosto
9 R (T y P), S, H y/o L
total
1 2
2 2
3 2
4 2
5 4
6 3
7 2
8 3
9 4
Note, if you have multiple (...)
within one vector replace sub(...
with gsub("\\([ [:alnum:]/]*\\)", "", cartel)