rtransformationncbi

R - Irregular metadata; create df from large single column


The title doesn't really do my question justice, because there are probably a few ways to skin this cat. But I picked one approach and went with it. This is what I'm working with:

I've pulled all the metadata for a particular study in the NCBI database using the "Send to:" option on their interface and downloading a .txt file.

In total, I have ~23k samples, each with up to 609 unique questions and answers from a questionnaire totaling 8M+ obs of 1 variable when read as a .csv. To my dismay, the metadata are irregular. Some samples have 140 associated key/value pairs. Others have 492. I've included a header of a sample below.

1: qiita_sid_10317:10317.BLANK1.6H.GUELPH
Identifiers: BioSample: SAMEA4790059; SRA: ERS2609990
Organism: metagenome
Attributes:
    /Alias="qiita_sid_10317:10317.BLANK1.6H.GUELPH"
    /description="American Gut control"
    /ENA checklist="ERC000011"
    /INSDC center alias="UCSDMI"
    /INSDC center name="University of California San Diego Microbiome Initiative"
    /INSDC first public="2018-07-13T17:03:10Z"
    /INSDC last update="2018-07-13T14:50:03Z"
    /INSDC status="public"
    /SRA accession="ERS2609990" 

I've tried (including but not limited to):

Can't seem to get past the cleaning steps:

samples <- read.csv("~/biosample_result_full.txt")
samples_split <- cSplit(samples, splitCols = sample$Colname, sep = "=")
samples_split$Attributes_1 <- gsub(" ", "_", samples_split$Attributes_1)
questions <- unique(samples_split$Attributes_1)

Ideally, each sample and associated metadata would be transformed into rows, with each "Attribute"/question as the column name.

This

Any help is greatly appreciated.


Solution

  • I see that the website you've linked to, allows fot the option to export data to xml. I strongly suggest to do so. R can hande/parse xml-files very efficient.

    When I download the first three results from that site to a file biosample_result.xml , it's easy to process using the xml2-package

    library( xml2 )
    library( magrittr )
    
    doc <- read_xml( "./biosample_result.xml")
    #gret all BioSample nodes
    BioSample.Nodes <- xml_find_all( doc, "//BioSample")
    #build a data.frame
    data.frame( 
      sample_name = xml_find_first( BioSample.Nodes , ".//Id[@db='SRA']") %>% xml_text(),
      stringsAsFactors = FALSE )
    
    #   sample_name
    # 1  ERS2609990
    # 2  ERS2609989
    # 3  ERS2609988
    

    So if you can use the XML, you will just have to use the right xpath-syntax to get the data/nodes you need, into the columns you want...
    In the exmaple above, I extracted (from each BioSample-node) the first ID-node with attribute db equals SRA, and stored the result in the co0lumn sample_name.

    Still assuming you can use the xml-data. If you are lokking for all attributes into one df, you need the functions from purrr, so just load the entire tidyverse

    library( tidyverse )
    df <- xml_find_all( doc, "//BioSample")  %>% 
      map_df(~{
        set_names(
          xml_find_all(.x, ".//Attribute") %>% xml_text(),
          xml_find_all(.x, ".//Attribute") %>% xml_attr( "attribute_name" )
    
        ) %>% 
          as.list() %>%  
          flatten_df() 
      })
    

    will result in a df like this enter image description here