rdplyrdesctools

Manipulation of tab delimited file for top 5 values based and printing adjacent columns values with it


I have a tab delimited file abc.txt

contig  score   guide
1:100-101   7   AAA
1:100-101   6   BBB
1:100-101   5   CCC
1:100-101   4   DDD
1:100-101   3   EEE
1:100-101   2   FFF
1:100-101   1   GGG
1:100-101   90  HHH
1:100-101   111 III
1:100-101   1111    JJJ
1:200-203   503.5333333 KKK
1:200-203   570.7212121 LLL
1:200-203   637.9090909 MMM
1:200-203   705.0969697 NNN
1:200-203   772.2848485 OOO
1:200-203   839.4727273 PPP
1:200-203   906.6606061 QQQ
1:200-203   973.8484848 RRR
2:300-301   1041.036364 SSS
2:300-301   1108.224242 TTT
2:300-301   1175.412121 UUU
2:300-301   1242.6  VVV
2:300-301   1309.787879 ABC
2:300-301   1376.975758 CGA
2:300-301   1444.163636 ACD

Column 1-Contig has multiple repeat values, column two has scores and column three has guide letters corresponding to column-2 scores. I need to select top 5 scores for the similar values in column one (contig) and print there corresponding column 3 values.

Output should look like this, with first column having the unique column 1-Contig entry and next 10 rows for the top 5 scores and corresponding column-3 guide letters

    Score-1 Guide-1 Score-2 Guide-2 Score-3 Guide-3 Score-4 Guide-4 Score-5 Guide-5
1:100-101   1111    JJJ 111 III 90  HHH 7   AAA 6   BBB
1:200-203   973.8484848 RRR 906.6606061 QQQ 839.4727273 PPP 772.2848485 OOO 705.0969697 NNN
2:300-301   1444.163636 ACD 1376.975758 CGA 1309.787879 ABC 1242.6  VVV 1175.412121 UUU

I used "dplyr" and "desctools" packages, however I am running with some error.

library(dplyr)
library(DescTools)
file <- "abc.txt"
x=read.table(file)
b <- Large(x, k=5, unique = FALSE, na.last=NA)

and getting this error

Error in Large(x, k = 5, unique = FALSE, na.last = NA) : 
  Not compatible with requested type: [type=character; target=double].

I was manged to do this in excel using 'sumproduct, large, iferror and vllokup' formulas, however for large datasets I want to extract file using R.

Any help will be much appreciated


Solution

  • The problem is large expects a numeric vector, not an entire dataframe. This is just a guess since I dont have a reproducible example, but you might want to do something along these lines:

    library(dplyr)
    library(DescTools)
    file <- "./abc.txt"
    x=read.table(file)
    colnames(x)<-c("contig","score","guide")
    x<-x[-1,]
    
    list <- split(x , f = x$contig )
    columntitles<-c()
    for (i in 1:5)
      columntitles<-c(columntitles,paste0("guide-",i),paste0("score-",i))
    x = data.frame(matrix(NA, nrow = 1, ncol = 10)) 
    colnames(x)<-columntitles
    
    for (i in 1:3){
      singlerow<-c()
      partialdata<-list[[i]]
      partialdata<-partialdata%>% top_n(5, score)
      partialdata<-partialdata[Rev(order(partialdata$score)),]
      for (j in 1:5){
        singlerow<-c(singlerow,toString(partialdata$guide[j]),toString(partialdata$score[j]))
    
      }
      x<-rbind(x,singlerow)
    }
    x<-x[-1,]