rdataframepairwiseinterleave

Convert pairwise distance table to list of distance for individuals in only two columns


I want to convert a pairwise distance table (observations in 2 columns) into a table with individuals listed instead (observations in 1 column). Essentially the info on pairwise relationships will be lost (which is irrelevant to my analysis anyways) and the distance value will need to be doubled for their respective rows.

I can separate the strings with this code:

pairwise_readout <- str_split_fixed(pairwise[,1], " ", 4) #splits strings apart
pairwise_readout <- data.frame(pairwise_readout,pairwise$dist) #places distance again

But have no idea how to continue with re-arranging the table into fewer columns. All search results bring up only pairwise table related solutions.

Here is an example dataset:

Important to note is that I'm also interested in the 'gr#' contained within the string for each observation.

pairwise <- data.frame(ind_comp = c("OP2645ii_d gr3 OP5048___g gr2","OP5046___e gr5 OP5048___g gr2","OP2413iiia gr1 OP5048___g gr2","OP5043___b gr1 OP5048___g gr2", "OP3088i___a gr1 OP5048___g gr2","OP5046___a gr5 OP5048___g gr2", "OP5048___b gr5 OP5048___g gr2", "OP5043___a gr3 OP5048___g gr2", "OP2645ii_d gr3 OP5048___g gr2", "OP2645ii_d gr3 OP5044___c gr2", "OP2413iiib gr4 OP5048___g gr2", "OP5046___c gr1 OP5048___g gr2"), dist = c(7.590363,6.449676,6.419955,6.349918,6.182623,6.162655,6.154232,6.140147,6.058633,5.962923,5.943956,5.863753))

Essentially I want a table that follows this form:

pairwise_table_less_columns <- data.frame(ind_comp = c("OP2645ii_d","OP5048___g","OP5046___e", "OP5048___g", "OP2413iiia", "OP5048___g", "OP5043___b", "OP5048___g", "OP3088i___a", "OP5048___g", "OP5046___a", "OP5048___g", "OP5048___b", "OP5048___g", "OP5043___a", "OP5048___g", "OP2645ii_d", "OP5048___g", "OP2645ii_d", "OP5044___c", "OP2413iiib", "OP5048___g", "OP5046___c", "OP5048___g"), gr = c("gr3","gr2","gr5", "gr2", "gr1", "gr2", "gr1", "gr2", "gr1", "gr2", "gr5", "gr2", "gr5", "gr2", "gr3", "gr2", "gr3", "gr2", "gr3", "gr2", "gr4", "gr2", "gr1", "gr2"), dist = c(7.590363,7.590363,6.449676,6.449676,6.419955,6.419955,6.349918,6.349918,6.182623,6.182623,6.162655,6.162655,6.154232,6.154232,6.140147,6.140147,6.058633,6.058633,5.962923,5.962923,5.943956,5.943956,5.863753,5.863753))

Solution

  • Another idea is to replace the second space with another delimeter, and split on that, i.e.

    library(dplyr)
    library(tidyr)
    
    pairwise %>% 
     mutate(ind_comp = gsub('([^ ]+ [^ ]+) ', '\\1|', ind_comp)) %>% 
     separate_rows(ind_comp, sep = '[|]')
    

    which gives,

              ind_comp     dist
    1   OP2645ii_d gr3 7.590363
    2   OP5048___g gr2 7.590363
    3   OP5046___e gr5 6.449676
    4   OP5048___g gr2 6.449676
    5   OP2413iiia gr1 6.419955
    6   OP5048___g gr2 6.419955
    7   OP5043___b gr1 6.349918
    8   OP5048___g gr2 6.349918
    9  OP3088i___a gr1 6.182623
    10  OP5048___g gr2 6.182623
    11  OP5046___a gr5 6.162655
    12  OP5048___g gr2 6.162655
    13  OP5048___b gr5 6.154232
    14  OP5048___g gr2 6.154232
    15  OP5043___a gr3 6.140147
    16  OP5048___g gr2 6.140147
    17  OP2645ii_d gr3 6.058633
    18  OP5048___g gr2 6.058633
    19  OP2645ii_d gr3 5.962923
    20  OP5044___c gr2 5.962923
    21  OP2413iiib gr4 5.943956
    22  OP5048___g gr2 5.943956
    23  OP5046___c gr1 5.863753
    24  OP5048___g gr2 5.863753