rselenium-webdriverweb-scrapingredditrselenium

Multiple tags getting captured while web scraping Reddit in R using RSelenium


enter image description hereI was writing code to web scrape the post title, comments and author names from a reddit post for a project. I am able to web scrape the post title, author names but the comments are not getting extracted correctly.

If there are 31 comments on the post, each comment is getting extracted 31 times. Here is the code below for reference:

# load packages
library(RSelenium)
library(netstat)

# start the server
rs_driver_object <- rsDriver(browser = 'firefox',verbose = FALSE, port = free_port(), chromever = NULL)

# create a client object
remDr <- rs_driver_object$client

# open a browser
remDr$open()
# maximize window
remDr$maxWindowSize()

remDr$navigate("https://www.reddit.com/r/AnimeReviews/comments/essf1u/assassination_classroom_is_a_1010_the_charm_the/")

Sys.sleep(2)

# scroll to the end of the webpage
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
Sys.sleep(2)
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")

load_more_comments <- remDr$findElement(using = 'xpath', '//*[@id="comment-tree"]/faceplate-partial/div[1]/button')
load_more_comments$clickElement()
#load_more_comments$refresh()

#pickup title
title <- remDr$findElement(using = 'xpath', '//*[@id="main-content"]/shreddit-title')$getElementAttribute('title')

#comments
comment_list <- remDr$findElements(using = 'tag name', 'shreddit-comment')
#print(typeof(comment_list))

for (each_comment in comment_list) {
  print(paste("Author --->", each_comment$getElementAttribute('author')))
  
  p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p")

  # Extract and print the text from each <p> tag
  for (p_tag in p_tags) {
    print(p_tag$getElementText())
  }

}

Refer to the screenshot below:

I'm not sure why it isn't working only once. There seems to be some issue in how

p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p") is working

Refer to the code above, I tried web scraping in R using RSelenium. I was trying to scrape the reddit comments but they are coming multiple times instead of once.


Solution

  • findElements searches the entire HTML, you need to use findChildElements. This should work (replacing your last loop):

    lapply(comment_list, \(c) {
      author <- unlist(c$getElementAttribute('author'))
      comment <- unlist(lapply(c$findChildElements(using = "xpath", value = ".//div[3]/div/p"), \(p) {
        p$getElementText()
      }))
      
      list(author = author, comment = comment)
    })
    
    #> [[1]]$author
    #> [1] "dotti1999"
    #> 
    #> [[1]]$comment
    #> [1] "this shit was fucking insane"                                                                  
    #> [2] "honestly I adored this anime when I first watched it..."
    #> ...
    #> [[3]]$author
    #> [1] "[deleted]"
    #> 
    #> [[3]]$comment
    #> [1] "Give me a be If premise and I will give it a watch"
    #>  ...
    

    Note that this still doesn't seem to get you replies to comments.