htmlrdataframehtml-tablehtml-tableextract

Extract html table into R but skip specific elements (<br>) in one certain column that has three lines/rows, in a multi-page html


I want to load the table in a link into a dataframe in R.

The following script successfully extracts the table:

install.packages("htmltab")
library(htmltabl)
url <- "http://www.hmdb.ca/metabolites?utf8=✓&filter=true&toxin=1&filter=true"
hm_ext <- htmltab(url)

But here is the problem:

> head(hm_ext)
     HMDB ID CAS Number        Name FormulaAverage Mass Monoisotopic Mass                                           Biospecimen Location
2    HMDB000014264-18-6 Formic acid              CH2O246.025446.005479308 BloodBreast MilkCerebrospinal Fluid (CSF)FecesSalivaSweatUrine
3   HMDB0000228108-95-2      Phenol              C6H6O94.111294.041864814                                     BloodFecesSalivaSweatUrine
4 HMDB000059818496-25-8     Sulfide                    S32.06531.97207069                                                          Blood
5  HMDB00005997440-47-3    Chromium                 Cr51.996151.940511904                      BloodCerebrospinal Fluid (CSF)SalivaUrine
6  HMDB00006577440-50-8      Copper                  Cu63.54662.929601079                      BloodCerebrospinal Fluid (CSF)SalivaUrine
7  HMDB00006627782-41-4    Fluoride                  F18.998418.998403205                                               BloodSalivaUrine

The third column, i.e. Formula Average Mass Monoisotopic Mass, has three rows/values/lines and they are all mixed together and thus showing up as one continuous string. I only want to extract the first line/row in this column OR somehow separate the three values from one another.

Here is how the first cell of the third column looks like in the rendered html page:

third column screenshot

Same thing happens if I use XML::readHTMLTable.

When I click on Inspect Element in Chrome, I can see such a structure for a cell in the Formula Average Mass Monoisotopic Mass column:

<td class="weight-value">CH<sub>2</sub>O<sub>2</sub><br><br>46.0254<br>46.005479308</td>

image -- please embed it. I do not have enough reputation to do that

However, there are also other times when the second and third lines in this column are empty. Example:

<td class="weight-value">(C<sub>12</sub>H<sub>19</sub>NO<sub>19</sub>S<sub>3</sub>)nH<sub>2</sub>O<br><br><span class="wishart wishart-not-available">Not Available</span><br></td>

image -- Inspect Element screenshot

So how can I extract the table from the given link but keep the structure of the third column readable and not mixed up? And furthermore, is it possible to extract the tables in all pages without looping through the links for each separate page?


Solution

  • One of the approach could be

    library(rvest)
    library(qdapRegex)
    library(XML)
    
    #read webpage
    htm_data <- read_html("http://www.hmdb.ca/metabolites?tf8=%E2%9C%93&filter=true&toxin=1&filter=true") 
    
    #convert above webpage's table into a dataframe
    df <- html_table(html_nodes(htm_data, "table"))[[1]]
    
    #cleanup data in the required column
    df[, 4] <- unlist(lapply(rm_between(xml_find_all(htm_data, "//table/tbody/tr/td[4]"), 
                                        ">", 
                                        "<br><br>", extract=TRUE), 
                             function(x) gsub("<.*?>", "", x[[1]])))
    

    which gives

    > head(df)
        HMDB ID  CAS Number        Name Structure FormulaAverage Mass Monoisotopic Mass
    1    HMDB000014264-18-6 Formic acid        NA                                 CH2O2
    2   HMDB0000228108-95-2      Phenol        NA                                 C6H6O
    3 HMDB000059818496-25-8     Sulfide        NA                                     S
    4  HMDB00005997440-47-3    Chromium        NA                                    Cr
    5  HMDB00006577440-50-8      Copper        NA                                    Cu
    6  HMDB00006627782-41-4    Fluoride        NA                                     F
                                                Biospecimen Location
    1 BloodBreast MilkCerebrospinal Fluid (CSF)FecesSalivaSweatUrine
    2                                     BloodFecesSalivaSweatUrine
    3                                                          Blood
    4                      BloodCerebrospinal Fluid (CSF)SalivaUrine
    5                      BloodCerebrospinal Fluid (CSF)SalivaUrine
    6                                               BloodSalivaUrine