rbioconductorbiomart

Minor alleles giving NA from R biomart getBM?


Does anyone know how to extract minor allele data from grch38 ? the following gives NAs: library("biomaRt"); snp.db3 <- useMart(host = "https://www.ensembl.org/", biomart = "ENSEMBL_MART_SNP", dataset = "hsapiens_snp"); nt.biomart3 <- getBM(attributes = c("refsnp_id", "minor_allele", "minor_allele_freq", "chrom_start", "chrom_strand", "associated_gene"), filters = c("snp_filter"), values = c("rs3762444", "rs284262", "rs655598", "rs12089815", "rs12140153", "rs788163", "rs1064213", "rs1106090", "rs7557796", "rs16825008"), mart = snp.db3, uniqueRows = TRUE); nt.biomart3 The data is apparently available on Ensembl (e.g. https://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=10:120308754-120309754;v=rs10788066;vdb=variation;vf=654619804) but the data extraction method only works if I use host = "https://grch37.ensembl.org" - otherwise it gives NAs for minor allele and minor allele frequency - is there something else I should do ? Or is there a better method ? Or is the minor allele data from grch37 actually up-to-date ?


Solution

  • Online Biomart (Martview) from Ensembl (https://www.ensembl.org/) doesn't show either this information :

    ensemble

    That explains why you're getting "NA" with "biomaRt".

    Online Biomart (Martview) from Ensembl GRCh37 (https://grch37.ensembl.org/) displays it : grch37

    Grch37 seems up to date (Release 112 (May 2024)).

    To produce the preceding screenshots :

    If you want data from Grch38, you can try to scrape the data directly from "Ensembl.org".

    ### Packages
    library(rvest)
    library(purrr)
    library(dplyr)
    
    ### SNPs to look for
    look=c("rs3762444", "rs284262", "rs655598", "rs12089815", "rs12140153", "rs788163", "rs1064213", "rs1106090", "rs7557796", "rs16825008")
    
    ### Function to get the data
    biomart=function(x){
      z=read_html(paste0("http://www.ensembl.org/homo_sapiens/Variation/Explore?v=",x))
      a= z %>% 
        html_element(xpath = '//span[.="Highest population MAF"]/following-sibling::span/b') %>%
        html_text2()
      if (is.na(a)) {b<-NA_character_}else{
        b = z %>% 
        html_element(xpath = '//span[.="Highest population MAF"]/following-sibling::span') %>%
        html_attr("title") %>%
        read_html() %>% html_text2()}
      data.frame(SNP=x,MAF=a,source=b)
      }
    
    ### Map operation
    out=map(look,biomart,.progress = TRUE)
    
    ### Build the dataframe
    output=bind_rows(out)
    

    Output :

             SNP  MAF                       source
    1   rs3762444 <NA>                         <NA>
    2    rs284262 0.49 T in 1000GENOMES:phase_3:SAS
    3    rs655598 0.49 A in 1000GENOMES:phase_3:FIN
    4  rs12089815 0.50 G in 1000GENOMES:phase_3:EUR
    5  rs12140153 0.18             T in gnomADg:ami
    6    rs788163 0.42 C in 1000GENOMES:phase_3:LWK
    7   rs1064213 0.50             A in gnomADe:nfe
    8   rs1106090 0.50             A in gnomADg:amr
    9   rs7557796 0.48 T in 1000GENOMES:phase_3:SAS
    10 rs16825008 0.49 A in 1000GENOMES:phase_3:ITU
    

    Notes :