I'm a newbie to R and currently working with collaboration data in the form of an edge list with 32 columns and around 200.000 rows in the following representation:
1 A A A B C A
2 A B B B C A
3 A B C C C C
4 B A B A B C
A, B, C represent the countries of the in the publication participating researchers' institutions. In the real data set "A" is e.g. a country name such as "England" or "China".
I want to keep unique records (A) and double records (A A) but remove triplets (A A A) and more occurence of the same record in a row. Collaborations are assigned to publications by ID so the order should stay the same to allow for later analysis. However, order within rows does not matter.
So, it should look like this eventually.
1 A A B C
2 A B B C A
3 A B C C
4 B A B A C
So far, I've tried a few things based on 1, 2, 3 and triplicated
from the tuple package.
df <- data.frame(CTR1 = c("A", "A", "A", "B"), CTR2 = c("A", "B", "B", "A"), CTR3 = c("A", "B", "B", "A"), CTR4 = c("B", "B", "C", "A"), CTR5 = c("C", "C", "C", "B"), CTR6 = c("A", "A", "C", "C"), ID = c(1,2,3,4))
# remember the ID
n <-df$ID
# transpose df (data frame)
dt <- as.data.frame(t(df[, -1]))
colnames(dt) <- n
library(tuple)
dt[!(triplicated(dt) | triplicated(dt, fromLast= TRUE)), ]
# Create new transposed variable
df2 <- as.data.frame(t(df))
However, I delete the complete rows using dt[!(triplicated(dt) | triplicated(dt, fromLast= TRUE)), ]
rather than just deleting the specific records that present a surplus by row what leads me to the following 4x4 table ...
CTR1 CTR3 CTR4 CTR5 CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
1 A A B C 1 A A B C
2 A B B C rather than 2 A B B C A
3 A B C C 3 A B C C
4 B A A B 4 B A B A C
I've also taken a look at dplyr
and the responses here and here but couldn't figure out a fitting approach so far.
Working Solution for original problem description
library(tidyr)
library(dplyr)
countrydf %>%
unite(concat,CTR1:CTR6, sep = "") %>%
mutate(concat = gsub("([a-zA-Z1-9])\\1{2,}", "\\1\\1-", concat)) %>%
separate(concat, paste0("CTR", 1:6), sep = "(?<=.)", remove = TRUE)
Edit1: Adjusted description for clarification: In the real data set "A" is e.g. a country name such as "England" or "China".
Edit2: Adding a more accurated reproducible example.
Added a more accurated reproducible example, incorporated correct answer for original problem description (see below):
countrydf <- data.frame(ID = c(1,2,3,4),
CTR1 = c("England", "England", "England", "China"),
CTR2 = c("England", "China", "China", "England"),
CTR3 = c("England", "China", "China", "England"),
CTR4 = c("China", "China", "USA", "England"),
CTR5 = c("USA", "USA", "USA", "China"),
CTR6 = c("England", "England", "USA", "USA"))
Aspired outcome
CTR1 CTR2 CTR3 CTR4 CTR5 CTR6 ID
1 England England China USA 1
2 England China China USA England 2
3 England China China USA USA 3
4 China England England China USA 4
We can use tidyr
functions (unite
and separate
) and with the help of regex achieve the desired output.
This should get very close to what you need.
library(tidyr)
library(dplyr)
df1 %>%
unite(concat,CTR1:CTR6, sep = "") %>%
mutate(concat = gsub("([a-zA-Z1-9])\\1{2,}", "\\1\\1-", concat)) %>%
separate(concat, paste0("CTR", 1:6), sep = "(?<=.)", remove = TRUE)
#> ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#> 1 1 A A - B C A
#> 2 2 A B B - C A
#> 3 3 A B B C C -
#> 4 4 B A A - B C
Thanks to @IceCreamToucan for the function:
library(tidyr)
library(dplyr)
ICT_fn <- function(x){
xsplit <- strsplit(x, '')[[1]]
xsplit[data.table::rowid(xsplit) >= 3] <- '-'
paste(xsplit, collapse = '')}
df1 %>%
unite(concat,CTR1:CTR6, sep = "") %>%
rowwise() %>%
mutate(concat = ICT_fn(concat)) %>%
separate(concat, paste0("CTR", 1:6), sep = "(?<=.)", remove = TRUE)
#> # A tibble: 4 x 7
#> ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 A A - B C -
#> 2 2 A B B - C A
#> 3 3 A B B C C -
#> 4 4 B A A - B C
This updated solution, with minimal tweaks, works for name of countries as well; look below:
df2 <- data.frame(ID = c(1,2,3,4),
CTR1 = c("England", "England", "England", "France"),
CTR2 = c("England", "France", "France", "England"),
CTR3 = c("England", "France", "France", "England"),
CTR4 = c("France", "France", "Germany", "England"),
CTR5 = c("Germany", "Germany", "Germany", "France"),
CTR6 = c("England", "England", "Germany", "Germany"))
library(tidyr)
library(dplyr)
ICT_fn <- function(x){ #Credits to IceCreamToucan
xsplit <- strsplit(x, ',')[[1]]
xsplit[data.table::rowid(xsplit) >= 3] <- '-'
paste(xsplit, collapse = ',')}
df2 %>%
unite(concat,CTR1:CTR6, sep = ",") %>%
rowwise() %>%
mutate(concat = ICT_fn(concat)) %>%
separate(concat, paste0("CTR", 1:6), sep = ",", remove = TRUE)
#> # A tibble: 4 x 7
#> ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 England England - France Germany -
#> 2 2 England France France - Germany England
#> 3 3 England France France Germany Germany -
#> 4 4 France England England - France Germany