rfunctiondplyrbioinformaticsncbi

User Defined Function not working in dplyr pipe


I have a dataset with proteins accession numbers (DataGranulomeTidy). I have written a function (extractInfo) in r to scrape some information of those proteins from the ncbi website. The function works as expected when I run it in a short "for" loop.

DataGranulomeTidy <- tibble(GIaccessionNumber = c("29436380", "4504165", "17318569"))

extractInfo <- function(GInumber){
    tempPage <- readLines(paste("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=", GInumber, "&db=protein&report=genpept&conwithfeat=on&withparts=on&show-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=1000000", sep = ""), skipNul = TRUE)
    tempPage  <- base::paste(tempPage, collapse = "")
    Accession <- str_extract(tempPage, "(?<=ACCESSION).{3,20}(?=VERSION)")
    Symbol    <- str_extract(tempPage, "(?<=gene=\").{1,20}(?=\")")
    GeneID    <- str_extract(tempPage, "(?<=gov/gene/).{1,20}(?=\">)")
    out       <- paste(Symbol, Accession, GeneID, sep = "---")
    return(out)
}


for(n in 1:3){
    print(extractInfo(GInumber = DataGranulomeTidy$GIaccessionNumber[n]))
}
 [1] "MYH9---   AAH49849---4627"
 [1] "GSN---   NP_000168---2934"
 [1] "KRT1---   NP_006112---3848"

When I use the same function in a dplyr pipe I doesn't work and I can't figure our why.

 > DataGranulomeTidy %>% mutate(NewVar = extractInfo(.$GIaccessionNumber))
 Error in file(con, "r") : argumento 'description' inválido

At this point I could make things work without using the "pipe" operator by using the "for" operator but I would like so much to understand why the function does not work in the dplyr pipe.


Solution

  • It is the cause that your UDF can't treat vector.

    vectorized_extractInfo <- Vectorize(extractInfo, "GInumber")
    
    DataGranulomeTidy %>% 
      mutate(NewVar = vectorized_extractInfo(GIaccessionNumber))