rpdftools

Can `pdftools` detect which radio button is selected on a fillable form?


Edit April 22, 2024 - I'm going to close this question after about three weeks. The short answer to the original question about whether the R package pdftools can parse pdf forms appears to be "no". Both @Felix_Jassler and @K-J provided useful "work-arounds". Personally, I found the combination of reticulate and the python module pypdf the best fit for my needs but that's a very different question.

The package has worked extremely well on processing "traditional" non fillable forms.

In my first attempts at using it with "fillable forms" I can't seem to find a way to distinguish between radio buttons at all (or checkboxes in an elegant way) that are selected and those that are not. I'm not sure if I'm missing some nuance, making a complete mistake, or whether the functions don't support it?

An example "blank" original form is here. For the reprex below I am focusing on a small segment of the form on page 1 that I have included as a screenshot after filling out and saving a few entries.Screenshot

I would like to know if there is a way to distinguish the fact that "Long Term Care" is selected in the filled out form versus not selected in the original?

Thank you in advance. Below is what I hope is a reprex that will help, since I could not find an easy safe place to "post" the example filled out form I used dput to put the resulting data in the reprex obviously users can grab the original and save changes to their local filesystem if desired.

suppressPackageStartupMessages(library(dplyr))
library(stringr)
## Not sure if poppler version matters?
library(pdftools)
#> Using poppler version 23.04.0

## Download and save the original form as original.pdf
download.file("https://www.cdc.gov/infectioncontrol/pdf/icar/IPC-demo-LTC-508.pdf", 
              "original.pdf")
## Download and save the original form as example.pdf
# download.file("https://www.cdc.gov/infectioncontrol/pdf/icar/IPC-demo-LTC-508.pdf", 
#              "filled.pdf")


## Let's use just the first page for the reprex
## Improve human readability with the str_split
original_pageone_text <- str_split(pdf_text("original.pdf")[[1]], "\n")[[1]]
original_pageone_text
#>  [1] "      Infection Control Assessment and Response (ICAR) Tool for General Infection"                                         
#>  [2] "                      Prevention and Control (IPC) Across Settings"                                                        
#>  [3] ""                                                                                                                          
#>  [4] " Section 1: Facility Demographics and Infection Prevention and Control (IPC) Infrastructure"                               
#>  [5] "                                     Long–Term Care"                                                                       
#>  [6] "General Facility Demographics and IPC Infrastructure"                                                                      
#>  [7] ""                                                                                                                          
#>  [8] "Date of Assessment:"                                                                                                       
#>  [9] ""                                                                                                                          
#> [10] "Facility Name:"                                                                                                            
#> [11] ""                                                                                                                          
#> [12] "State/Territory:                                                                County:"                                   
#> [13] ""                                                                                                                          
#> [14] "Zip Code:                 State/Territory-assigned Unique ID (if applicable):"                                             
#> [15] ""                                                                                                                          
#> [16] "Facility type (Complete the demographic form that             NHSN Facility Organization ID (if applicable):"              
#> [17] "corresponds to the type of facility):"                                                                                     
#> [18] "                                                              CMS Facility ID (if applicable):"                            
#> [19] " ● Acute Care Hospital / Critical Access Hospital"                                                                         
#> [20] " ● Long-term Care"                                                                                                         
#> [21] " ● Outpatient/Ambulatory Care"                                                                                             
#> [22] " ● Other (specify):"                                                                                                       
#> [23] ""                                                                                                                          
#> [24] ""                                                                                                                          
#> [25] ""                                                                                                                          
#> [26] "Facility Respondent Name(s) and Job Title(s):"                                                                             
#> [27] ""                                                                                                                          
#> [28] ""                                                                                                                          
#> [29] ""                                                                                                                          
#> [30] "Rationale for assessment:"                                                                                                 
#> [31] " ■ Requested by facility"                                                                                                  
#> [32] " ■ Requested by accrediting agency/ licensing organization"                                                                
#> [33] " ■ Requested by state or local health department"                                                                          
#> [34] " ■ HAI prevention focused:"                                                                                                
#> [35] "     ■ CAUTI"                                                                                                              
#> [36] "     ■ CLABSI"                                                                                                             
#> [37] "     ■ SSI"                                                                                                                
#> [38] "     ■ CDI"                                                                                                                
#> [39] "     ■ Other (specify):"                                                                                                   
#> [40] " ■ Prevention collaborative (specify partners):"                                                                           
#> [41] " ■ Outbreak (specify):"                                                                                                    
#> [42] " ■ Other (specify):"                                                                                                       
#> [43] ""                                                                                                                          
#> [44] "                         Obtain a list of products used for cleaning and disinfection of environmental surfaces and"       
#> [45] "                                          non-critical patient/resident care equipment in the facility"                    
#> [46] ""                                                                                                                          
#> [47] "EPA registration number(s) for products used in patient/resident rooms:"                                                   
#> [48] ""                                                                                                                          
#> [49] "EPA registration number(s) for products used in common areas:"                                                             
#> [50] ""                                                                                                                          
#> [51] "EPA registration number(s) for products used on non-critical patient/resident care equipment (e.g., blood glucose meters):"
#> [52] ""                                                                                                                          
#> [53] ""                                                                                                                          
#> [54] ""                                                                                                                          
#> [55] ""                                                                                                                          
#> [56] "CS334433-M 12/14/2022"                                                                                                     
#> [57] ""
## Let's focus on the radio buttons at 19 through 22 I plan on
## selecting 20 just in case it is the way the character is
## "displayed" on-screen we'll get the value as an integer
utf8ToInt(original_pageone_text[20]) #9679 is the radio button
#>  [1]   32 9679   32   76  111  110  103   45  116  101  114  109   32   67   97
#> [16]  114  101
original_radio_button <- utf8ToInt(original_pageone_text[20])


## Open filled.pdf and manually enter a few text boxes
## Select long term care radio button and save
## On screen it "appears" different
## To keep it reproducible I am putting the results
## of dput(str_split(pdf_text("filled.pdf")[[1]], "\n")[[1]]) here

filled_pageone_text <-
  structure(c("      Infection Control Assessment and Response (ICAR) Tool for General Infection", 
              "                      Prevention and Control (IPC) Across Settings", 
              "", " Section 1: Facility Demographics and Infection Prevention and Control (IPC) Infrastructure", 
              "                                     Long–Term Care", "General Facility Demographics and IPC Infrastructure", 
              "", "Date of Assessment:", "", "Facility Name: Imaginary", "", 
              "State/Territory:                                                                 County:", 
              "", "Zip Code:                  State/Territory-assigned Unique ID (if applicable):", 
              "", "Facility type (Complete the demographic form that              NHSN Facility Organization ID (if applicable):", 
              "corresponds to the type of facility):", "                                                               CMS Facility ID (if applicable):", 
              " ● Acute Care Hospital / Critical Access Hospital", " ● Long-term Care", 
              " ● Outpatient/Ambulatory Care", " ● Other (specify):", "", 
              "", "", "Facility Respondent Name(s) and Job Title(s):", "", 
              "", "", "Rationale for assessment:", " ✓", " ■ Requested by facility", 
              " ✓", " ■ Requested by accrediting agency/ licensing organization", 
              " ■ Requested by state or local health department", " ✓", 
              " ■ HAI prevention focused:", " ✓", "     ✓ CAUTI", "     ■", 
              "     ✓ CLABSI", "     ■", "     ■ SSI", "     ■ CDI", 
              "     ■ Other (specify):", " ■ Prevention collaborative (specify partners):", 
              " ✓", " ■ Outbreak (specify):", " ■ Other (specify):", 
              "", "                         Obtain a list of products used for cleaning and disinfection of environmental surfaces and", 
              "                                          non-critical patient/resident care equipment in the facility", 
              "", "EPA registration number(s) for products used in patient/resident rooms:", 
              "", "EPA registration number(s) for products used in common areas:", 
              "", "EPA registration number(s) for products used on non-critical patient/resident care equipment (e.g., blood glucose meters):", 
              "", "", "", "", "CS334433-M 12/14/2022", ""))

filled_pageone_text
#>  [1] "      Infection Control Assessment and Response (ICAR) Tool for General Infection"                                         
#>  [2] "                      Prevention and Control (IPC) Across Settings"                                                        
#>  [3] ""                                                                                                                          
#>  [4] " Section 1: Facility Demographics and Infection Prevention and Control (IPC) Infrastructure"                               
#>  [5] "                                     Long–Term Care"                                                                       
#>  [6] "General Facility Demographics and IPC Infrastructure"                                                                      
#>  [7] ""                                                                                                                          
#>  [8] "Date of Assessment:"                                                                                                       
#>  [9] ""                                                                                                                          
#> [10] "Facility Name: Imaginary"                                                                                                  
#> [11] ""                                                                                                                          
#> [12] "State/Territory:                                                                 County:"                                  
#> [13] ""                                                                                                                          
#> [14] "Zip Code:                  State/Territory-assigned Unique ID (if applicable):"                                            
#> [15] ""                                                                                                                          
#> [16] "Facility type (Complete the demographic form that              NHSN Facility Organization ID (if applicable):"             
#> [17] "corresponds to the type of facility):"                                                                                     
#> [18] "                                                               CMS Facility ID (if applicable):"                           
#> [19] " ● Acute Care Hospital / Critical Access Hospital"                                                                         
#> [20] " ● Long-term Care"                                                                                                         
#> [21] " ● Outpatient/Ambulatory Care"                                                                                             
#> [22] " ● Other (specify):"                                                                                                       
#> [23] ""                                                                                                                          
#> [24] ""                                                                                                                          
#> [25] ""                                                                                                                          
#> [26] "Facility Respondent Name(s) and Job Title(s):"                                                                             
#> [27] ""                                                                                                                          
#> [28] ""                                                                                                                          
#> [29] ""                                                                                                                          
#> [30] "Rationale for assessment:"                                                                                                 
#> [31] " ✓"                                                                                                                        
#> [32] " ■ Requested by facility"                                                                                                  
#> [33] " ✓"                                                                                                                        
#> [34] " ■ Requested by accrediting agency/ licensing organization"                                                                
#> [35] " ■ Requested by state or local health department"                                                                          
#> [36] " ✓"                                                                                                                        
#> [37] " ■ HAI prevention focused:"                                                                                                
#> [38] " ✓"                                                                                                                        
#> [39] "     ✓ CAUTI"                                                                                                              
#> [40] "     ■"                                                                                                                    
#> [41] "     ✓ CLABSI"                                                                                                             
#> [42] "     ■"                                                                                                                    
#> [43] "     ■ SSI"                                                                                                                
#> [44] "     ■ CDI"                                                                                                                
#> [45] "     ■ Other (specify):"                                                                                                   
#> [46] " ■ Prevention collaborative (specify partners):"                                                                           
#> [47] " ✓"                                                                                                                        
#> [48] " ■ Outbreak (specify):"                                                                                                    
#> [49] " ■ Other (specify):"                                                                                                       
#> [50] ""                                                                                                                          
#> [51] "                         Obtain a list of products used for cleaning and disinfection of environmental surfaces and"       
#> [52] "                                          non-critical patient/resident care equipment in the facility"                    
#> [53] ""                                                                                                                          
#> [54] "EPA registration number(s) for products used in patient/resident rooms:"                                                   
#> [55] ""                                                                                                                          
#> [56] "EPA registration number(s) for products used in common areas:"                                                             
#> [57] ""                                                                                                                          
#> [58] "EPA registration number(s) for products used on non-critical patient/resident care equipment (e.g., blood glucose meters):"
#> [59] ""                                                                                                                          
#> [60] ""                                                                                                                          
#> [61] ""                                                                                                                          
#> [62] ""                                                                                                                          
#> [63] "CS334433-M 12/14/2022"                                                                                                     
#> [64] ""
## Same methodology as above
utf8ToInt(filled_pageone_text[20]) #9679 is the radio button
#>  [1]   32 9679   32   76  111  110  103   45  116  101  114  109   32   67   97
#> [16]  114  101
filled_radio_button <- utf8ToInt(filled_pageone_text[20])

## Clearly different
identical(original_pageone_text, filled_pageone_text)
#> [1] FALSE
setdiff(filled_pageone_text, original_pageone_text)
#> [1] "Facility Name: Imaginary"                                                                                     
#> [2] "State/Territory:                                                                 County:"                     
#> [3] "Zip Code:                  State/Territory-assigned Unique ID (if applicable):"                               
#> [4] "Facility type (Complete the demographic form that              NHSN Facility Organization ID (if applicable):"
#> [5] "                                                               CMS Facility ID (if applicable):"              
#> [6] " ✓"                                                                                                           
#> [7] "     ✓ CAUTI"                                                                                                 
#> [8] "     ■"                                                                                                       
#> [9] "     ✓ CLABSI"

identical(original_radio_button, filled_radio_button)
#> [1] TRUE


sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pdftools_3.4.0 stringr_1.5.1  dplyr_1.1.4   
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.2         knitr_1.45        rlang_1.1.3      
#>  [5] xfun_0.42         stringi_1.8.3     purrr_1.0.2       styler_1.10.2    
#>  [9] generics_0.1.3    glue_1.7.0        askpass_1.2.0     qpdf_1.3.2       
#> [13] htmltools_0.5.7   fansi_1.0.6       rmarkdown_2.25    R.cache_0.16.0   
#> [17] tibble_3.2.1      evaluate_0.23     fastmap_1.1.1     yaml_2.3.8       
#> [21] lifecycle_1.0.4   compiler_4.3.2    fs_1.6.3          Rcpp_1.0.12      
#> [25] pkgconfig_2.0.3   rstudioapi_0.15.0 R.oo_1.26.0       R.utils_2.12.3   
#> [29] digest_0.6.34     R6_2.5.1          tidyselect_1.2.0  utf8_1.2.4       
#> [33] reprex_2.1.0      pillar_1.9.0      magrittr_2.0.3    R.methodsS3_1.8.2
#> [37] tools_4.3.2       withr_3.0.0

Created on 2024-04-01 with reprex v2.1.0


Solution

  • Inspired by @KJ's comment, here's a somewhat laborious approach. We start by reading the binary of the PDF file.

    filename <- "original.pdf"
    
    # from the binary, remove all non-printable ascii characters
    # then, convert to string
    pdf_content <- (
        readBin(filename, "raw", file.info(filename)$size)
        |> Filter(f = function(x) x == 10 || (x >= 32 && x < 127))
        |> rawToChar()
    )
    
    pdf_lines <- strsplit(pdf_content, "\n")[[1]]
    

    From here, we could try to find some of the selected values by searching for tags starting with the /V attribute.

    > lines <- which(grepl('<< /V', pdf_lines, fixed = TRUE))
    > pdf_lines[lines]                  
     [ ] ... 
    [20] "<< /V /Long-term#20Care /Kids [ 25 0 R 26 0 R 27 0 R 28 0 R ] /T (S1 GF 7) /FT /Btn >>"
    

    Notice how the document also contains 20 radio button groups. In this context, the attributes mean the following:

    With this approach, a simple solution would be to find all /V tags. Then you have to find out, which titled tag corresponds to which radio button group in the PDF. Here, I'll use some regex magic to extract all key-value pairs for the radio buttons (/T being the key, /V being the value).

    # note we're parsing the entire pdf string, not the lines individually
    findRadioButtonValues <- function(pdf_content) {
        # replacing '\\s+' with a single whitespace is not really necessary in your document
        # I'll just add it for save-guarding in case it becomes a problem in other documents
    
        # Note also that I'm assuming that all radio buttons start with '<< /V'
        values <- (
            pdf_content
            |> stringr::str_replace_all('\n', ' ')
            |> stringr::str_replace_all('\\s+', ' ')
            |> stringr::str_match_all('<< /V /(.*?) [^>]*?/T (\\([^)]*\\))[^>]*>>')
        )
        values <- values[[1]]
        
        # matching contains 3 groups that are split into three columns
        # column 1: entire tags, "<< ... >>" (can be omitted)
        # column 2: value of tags
        # column 3: title of tags
        values[,c(3,2)]
        
        # if you prefer a named list with appropriate key-value pairs:
        # structure(as.list(values[,2]), names = values[,3])
    }
    

    Then, we can call the function and get back all 20 radio button values.

    > findRadioButtonValues(pdf_content)
          [,1]         [,2]                         
     [1,] "(LTC 12f)"  ""                           
     [2,] "(LTC 12d )" ""                           
     [3,] "(LTC 12c)"  ""                           
     [4,] "(LTC 12)"   ""                           
     [5,] "(LTC 9a 1)" ""                           
     [6,] "(LTC 9)"    ""                           
     [7,] "(LTC 8)"    ""                           
     [8,] "(LTC 4)"    ""                           
     [9,] "(LTC 3)"    "Government#20#28not#20VA#29"
    [10,] "(LTC 2)"    "Dual#20Medicare#2fMedicaid" 
    [11,] "(S1 11)"    ""                           
    [12,] "(S1 10)"    "Not#20Assessed"             
    [13,] "(S1 9)"     "Yes"                        
    [14,] "(S1 8)"     "No"                         
    [15,] "(S1 7)"     "Unknown"                    
    [16,] "(S1 6)"     "Not#20Assessed"             
    [17,] "(S1 3a)"    "No"                         
    [18,] "(S1 2a)"    "Unknown"                    
    [19,] "(S1 1a)"    "No"                         
    [20,] "(S1 GF 7)"  "Long-term#20Care"