enter image description hereI was writing code to web scrape the post title, comments and author names from a reddit post for a project. I am able to web scrape the post title, author names but the comments are not getting extracted correctly.
If there are 31 comments on the post, each comment is getting extracted 31 times. Here is the code below for reference:
# load packages
library(RSelenium)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = 'firefox',verbose = FALSE, port = free_port(), chromever = NULL)
# create a client object
remDr <- rs_driver_object$client
# open a browser
remDr$open()
# maximize window
remDr$maxWindowSize()
remDr$navigate("https://www.reddit.com/r/AnimeReviews/comments/essf1u/assassination_classroom_is_a_1010_the_charm_the/")
Sys.sleep(2)
# scroll to the end of the webpage
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
Sys.sleep(2)
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
load_more_comments <- remDr$findElement(using = 'xpath', '//*[@id="comment-tree"]/faceplate-partial/div[1]/button')
load_more_comments$clickElement()
#load_more_comments$refresh()
#pickup title
title <- remDr$findElement(using = 'xpath', '//*[@id="main-content"]/shreddit-title')$getElementAttribute('title')
#comments
comment_list <- remDr$findElements(using = 'tag name', 'shreddit-comment')
#print(typeof(comment_list))
for (each_comment in comment_list) {
print(paste("Author --->", each_comment$getElementAttribute('author')))
p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p")
# Extract and print the text from each <p> tag
for (p_tag in p_tags) {
print(p_tag$getElementText())
}
}
Refer to the screenshot below:
I'm not sure why it isn't working only once. There seems to be some issue in how
p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p") is working
Refer to the code above, I tried web scraping in R using RSelenium. I was trying to scrape the reddit comments but they are coming multiple times instead of once.
findElements
searches the entire HTML, you need to use findChildElements
. This should work (replacing your last loop):
lapply(comment_list, \(c) {
author <- unlist(c$getElementAttribute('author'))
comment <- unlist(lapply(c$findChildElements(using = "xpath", value = ".//div[3]/div/p"), \(p) {
p$getElementText()
}))
list(author = author, comment = comment)
})
#> [[1]]$author
#> [1] "dotti1999"
#>
#> [[1]]$comment
#> [1] "this shit was fucking insane"
#> [2] "honestly I adored this anime when I first watched it..."
#> ...
#> [[3]]$author
#> [1] "[deleted]"
#>
#> [[3]]$comment
#> [1] "Give me a be If premise and I will give it a watch"
#> ...
Note that this still doesn't seem to get you replies to comments.