rjsonrvest

Combining JSON and Regex in R


I am learning how to use the Reddit API - I am trying to learn how to extract all comments from a specific post.

For example - consider this post:https://www.reddit.com/r/Homebrewing/comments/11dd5r3/worst_mistake_youve_made_as_a_homebrewer/

Using this R code, I think I was able to access the comments:

library(httr)
library(jsonlite)

# Set authentication parameters
auth <- authenticate("some-key1", "some_key2")

# Set user agent
user_agent <- "my_app/0.1"

# Get access token
response <- POST("https://www.reddit.com/api/v1/access_token",
                 auth = auth,
                 user_agent = user_agent,
                 body = list(grant_type = "password",
                             username = "abc123",
                             password = "123abc"))

# Extract access token from response
access_token <- content(response)$access_token

# Use access token to make API request
url <- "https://oauth.reddit.com/LISTING" # Replace "LISTING" with the subreddit or endpoint you want to access

headers <- c("Authorization" = paste("Bearer", access_token))
result <- GET(url, user_agent(user_agent), add_headers(headers))

post_id <- "11dd5r3"
url <- paste0("https://oauth.reddit.com/r/Homebrewing/comments/", post_id)

# Set the user agent string 
user_agent_string <- "MyApp/1.0"

# Set the authorization header 
authorization_header <- paste("Bearer ", access_token, sep = "")

# Make the API request 
response <- GET(url, add_headers(Authorization = authorization_header, `User-Agent` = user_agent_string))

# Extract the response content and parse 
response_json <- rawToChar(response$content)

From here, it looks like all comments are stored between a set of <p> and </p>:

Using this logic, I tried to only keep text between these symbols via Regex:

final = response_json[1]
matches <- gregexpr("<p>(.*?)</p>", final)
matches_text <- regmatches(final, matches)[[1]]

I think this code partly worked - but many entries were returned that were not comments:

[212] "<p>Worst mistake was buying malt hops and yeast and letting it go stale.</p>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[213] "<p>Posts&#32;are&#32;automatically&#32;archived&#32;after&#32;6&#32;months.</p>"

Can someone please show me a better way of doing this? How can I only extract the comment text and nothing else?

Thanks!


Solution

  • If you want to use regex anyway, probably you should try a pattern like (?<=<p>).*?(?=</p>), e.g.,

    > s <- "<p>xxxxx</p> <p>xyyyyyyyyy</p> <p>zzzzzzzzzzzz</p>"
    
    > regmatches(s, gregexpr("(?<=<p>).*?(?=</p>)", s, perl = TRUE))[[1]]
    [1] "xxxxx"        "xyyyyyyyyy"   "zzzzzzzzzzzz"