rqdapregex

Grep html code between html tags containing a keyword in R


Within a file, I would like to use grep or maybe use the package qdapRegex's rm_between function to extract a whole section of html code containing a keyword, lets say "discount rate" for this example. Specifically, I want results that look like this code snippet:

<P>This is a paragraph containing the words discount rate including other things.</P>

and

<TABLE width="400">
  <tr>
    <th>Month</th>
    <th>Savings</th>
  </tr>
  <tr>
    <td>Discount Rate</td>
    <td>10.0%</td>
  </tr>
  <tr>
    <td>February</td>
    <td>$80</td>
  </tr>
</TABLE>

  1. The trick here is it must find discount rate first and then pull out the rest.
  2. It is always going to be between <P> and </P> or <TABLE and </TABLE> and no other html tags.

A good sample .txt file for this can be found here:

https://www.sec.gov/Archives/edgar/data/66740/0000897101-04-000425.txt


Solution

  • You can consider the file as html and explore it as if you were scraping it with rvest:

    library(rvest)
    library(stringr)
    
    # Extract the html from the file
    html = read_html('~/Downloads/0000897101-04-000425.txt')
    
    # Get all the 'p' nodes (you can do the same for 'table')
    p_nodes <- html %>% html_nodes('p')
    
    # Get the text from each node
    p_nodes_text <- p_nodes %>% html_text()
    
    # Find the nodes that have the term you are looking for
    match_indeces <- str_detect(p_nodes_text, fixed('discount rate', ignore_case = TRUE))
    
    # Keep only the nodes with matches
    # Notice that I remove the first match because rvest adds a 
    # 'p' node to the whole file, since it is a text file
    match_p_nodes <- p_nodes[match_indeces][-1]
    
    # If you want to see the results, you can print them like this
    # (or you could send them to a file)
    for(i in 1:length(match_p_nodes)) {
      cat(paste0('Node #', i, ': ', as.character(match_p_nodes[i]), '\n\n'))
    }
    

    For the <table> tags, you would not remove the first match:

    table_nodes <- html %>% html_nodes('table')
    table_nodes_text <- table_nodes %>% html_text()
    match_indeces_table <- str_detect(table_nodes_text, fixed('discount rate', ignore_case = TRUE))
    match_table_nodes <- table_nodes[match_indeces_table]
    
    for(i in 1:length(match_table_nodes)) {
      cat(paste0('Node #', i, ': ', as.character(match_table_nodes[i]), '\n\n'))
    }