rdataframeselectgenetic-programming

R Grouping data from dataframes for data analysis


I need help with that because I do not know how to handle this. I have 2 dataframes, they look like this:

(df1) DataGenSample: each column is a sample and the first one is the gene

enter image description here

(df2) Subtypes: a df of 2 columns, the 1st col is the sample and the 2nd col is a subtype of cancer

enter image description here

The first thing what i'm looking for is to select only the matching samples of DataGenSample from subtypes, and then separate them with its subtype.

The data files can be found here

Any help is more than welcome! because I'm lost.

DataGenSample <- read.table("DataGenSample.txt",sep="\t", header=TRUE, check.names = FALSE)
Subtypes <- read.table("SamplesType.txt",sep="\t", header=TRUE, check.names = FALSE)

A little example: df1:

hugo_symbol   TCGA-3C-AAAU-01    TCGA-3C-AALI-01    TCGA-3C-AALJ-01 ... TCGA-3C-AALL-99
CDK11A               0                 -1                -1         ...     -1
HNRNPR               0                 -1                -1         ...     -1
SRSF10               0                 -1                -1         ...     -1

df2:

Sample_id            Subtype
TCGA-3C-AAAU-01        BRCA_LumA
TCGA-3C-AALI-01        BRCA_Her2
TCGA-3C-AALL-99        BRCA_Normal

Output Expected:

-BRCA_LumA.df:

hugo_symbol   TCGA-3C-AAAU-01    
CDK11A               0              
HNRNPR               0              
SRSF10               0                

-BRCA_Her2.df:

hugo_symbol   TCGA-3C-AALI-01   
CDK11A               -1              
HNRNPR               -1              
SRSF10               -1   

-BRCA_Normal.df:

hugo_symbol   TCGA-3C-AALL-99   
CDK11A               -1              
HNRNPR               -1              
SRSF10               -1   

Solution

  • If I understand correctly you want to select a subset of columns from DataGenSample corresponsding to a certain subtype in the Subtypes. This can be achieved by pivoting the columns to rows using pivot_longer() from tidyr package (name was gather() in older versions). After the pivot you can join the two data frames on SAMPLE_ID.

    You can now filter on subtype and the SAMPLE_IDs (now less in number) can be pivoted back to columns. You can do this for all subtypes separately using a for loop, using assign() to name to data frame according to the subtype in the filter.

    library(dplyr)
    library(tidyr)
    
    DataGenSample_long <- DataGenSample %>% 
      pivot_longer(names_to = 'SAMPLE_ID', values_to = 'value', cols = -Hugo_Symbol) 
    
    DataGenSample_long_join <- DataGenSample_long %>% 
      left_join(Subtypes, by = 'SAMPLE_ID')
    
    for (Subtype in unique(Subtypes$SUBTYPE)) {
      assign(paste0(Subtype,'.df'), 
             DataGenSample_long_join %>% 
               filter(SUBTYPE == Subtype) %>% 
               select(-SUBTYPE) %>% 
               pivot_wider(names_from = SAMPLE_ID, values_from = value))
    }