rweb-scrapingrvest

How do I find html_node on search form?


I have a list of names (first name, last name, and date-of-birth) that I need to search the Fulton County Georgia (USA) Jail website to determine if a person is in or released from jail.

The website is http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400

The site requires you enter a last name and first name, then it gives you a list of results.

I have found some Stack Overflow posts that have given me some direction, but I'm still struggling to figure this out. I"m using this post as and example to follow. I am using SelectorGaget to help figure out the CSS tags.

Here is the code I have so far. Right now I can't figure out what html_node to use.

library(rvest)

# Specify URL
fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"

# start session
jail <- html_session(fc.url)

# Grab initial form
form.unfilled <- jail %>% html_node("form")

form.unfilled

The result I get from form.unfilled is {xml_missing} <NA> which I know isn't right.

I think if I can figure out the html_node value, I can proceed to using set_values and submit_form.


Solution

  • It appears on the initial call the webpage opens onto "http://justice.fultoncountyga.gov/PAJailManager/default.aspx". Once the session is started you should be able to jump to the search page:

    library(rvest)
    
    # Specify URL
    fc.url <- "http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400"
    
    # start session
    jail <- html_session("http://justice.fultoncountyga.gov/PAJailManager/default.aspx")
    #jump to search page
    jail2 <- jail %>% jump_to("http://justice.fultoncountyga.gov/PAJailManager/JailingSearch.aspx?ID=400")
    
    #list the form's fields
    html_form(jail2)[[1]]
    
    # Grab initial form
    form.unfilled <- jail2 %>% html_node("form")
    

    Note: Verify that your actions are within the terms of service for the website. Many sites do have policy against scraping.