r pdf text nlp ocr

Split in Half Long PDF Texts with Poor OCR-ing to Generate Speaker Turns in the Correct Order (R)

I am trying to process text for quantitative text analysis. I need to read in pdfs of transcripts from WHO plenary meetings and process the text into a speaker turn dataframe, identifying the speaker and everything that they say afterward. I am attempting this in R, since this is the software with which I am most familiar. My current code is the following:


pacman::p_load(tidyverse,
               stringr,
               fs,
               tibble,
               pdftools,
               stringr,
               dplyr,
               purrr)


speaker_list_unique <- structure(list(country = c("Afghanistan", "Afghanistan", "Albania", 
"Albania", "Albania", "Albania", "Argentina", "Argentina", "Argentina", 
"Australia", "Australia", "Australia", "Australia", "Australia", 
"Australia", "Australia", "Austria", "Austria", "Austria", "Austria", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium"), speaker = c("Dr. G. FAROUK, Deputy Minister for Public Health (Chief Delegate)", 
"Dr. A. ZAHIR, Director-General of the Kabul Municipal Hospitals", 
"Mr. B. SHTYLLA, Minister Plenipotentiary, Ministry of Foreign Affairs (Chief Delegate)", 
"Dr. S. KLosi, Ministry of Public Health", "Mr. V. NATHANAIL, Ministry of Foreign Affairs", 
"Mr. F. KOTA, Assistant Chief, Department for International Organizations, Ministry of Foreign Affairs", 
"Dr. A. ZWANCK, Professor of Hygiene, University of Buenos Aires (Chief Delegate)", 
"Dr. G. GALVEZ BUNGE, Director-General, Department of Sanitary Legislation, Ministry of Public Health", 
"Dr. A. A. Pozzo, Director of Technical Education and Scientific Research, Ministry of Public Health", 
"Dr. G. M. REDSHAW, Chief Medical Officer, Australia House, London (Chief Delegate)", 
"Mr. B. C. BALLARD, Counsellor, Australian Embassy, Paris", "Mr. W. G. A. LANDALE, Second Secretary, Australian Legation, The Hague", 
"Dr. H. E. DOWNES, Assistant Director-General of Health (Chief Delegate)", 
"Dr. D. A. DOWLING, Chief Medical Officer, Australia House, London", 
"Mr. J. PLIMSOLL, Department of External Affairs", "Mr. J. R. ROWLAND, Department of External Affairs", 
"Dr. F. REUTER, Professor, University of Vienna ; Chief, Bureau of Public Health, Ministry of Social Welfare (Chief Delegate)", 
"Dr. F. PUNTIGAM, Counsellor, Ministry of Social Welfare", "Mr. K. STROBL, Counsellor, Ministry of Social Welfare", 
"Dr. A. KHAUM, Director of Public Health (Chief Delegate)", "M. A. VERBIST, Ministre de la Santé publique et de la Famille (Chief Delegate)", 
"M. L. A. D. GEERAERTS, Directeur de Chancellerie de première claise au Ministère des Affaires étrangères et du Commerce extérieur", 
"Professor M. DE LAËT, Secrétaire général du Ministère de la Santé publique et de la Famille", 
"Dr. A. N. DUREN, Conseiller medical au Ministère des Colonies", 
"Baron C. VAN DER BRUGGEN, Attaché de Cabinet au Ministère de la Santé publique et de la Famille"
), speaker_condensed = c("FAROUK", "ZAHIR", "SHTYLLA", "KLOSI", 
"NATHANAIL", "KOTA", "ZWANCK", "GÁLVEZ BUNGE", "POZZO", "REDSHAW", 
"BALLARD", "LANDALE", "DOWNES", "DOWLING", "PLIMSOLL", "ROWLAND", 
"REUTER", "PUNTIGAM", "STROBL", "KHAUM", "VERBIST", "GEERAERTS", 
"DE LAËT", "DUREN", "VAN DER BRUGGEN"), organization = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_)), row.names = c(NA, 
-25L), class = c("tbl_df", "tbl", "data.frame"))

pdf_path <- "webscrape/who/plenary/manual/WHA_1948.pdf"

text_boundaries = tribble(~year,~start_page,~end_page,
                         1948,23,106,
                         1949,79,147,
                         1950,97,187)

# read target pages

start_page <- text_boundaries$start_page[1]
end_page   <- min(text_boundaries$end_page[1], length(pdf_text(pdf_path)))

pdf_pages <- pdf_text(pdf_path)[start_page:end_page]

# two columns function

read_two_columns <- function(page_text) {
  # Split lines
  lines <- str_split(page_text, "\n")[[1]]
  # split lines in half based on character width
  max_width <- max(nchar(lines))
  mid <- ceiling(max_width / 2)
  
  left_lines <- str_sub(lines, 1, mid)
  right_lines <- str_sub(lines, mid+1, nchar(lines))
  
  # Collapse left first then right
  paste(c(left_lines, right_lines), collapse = " ")
}

# Apply to all pages
full_text <- map_chr(pdf_pages, read_two_columns) %>%
  paste(collapse = " ")

# Clean spacing
full_text <- str_squish(full_text)

# 4. locate speaker turns

speaker_patterns <- speaker_list_unique$speaker_condensed

# Prepare regex 
pattern <- paste0("\\b(", paste(speaker_patterns, collapse="|"), ")\\b")

# Find all matches
matches <- str_locate_all(full_text, regex(pattern, ignore_case = TRUE))[[1]]

# build speaker turn df

speaker_turns <- map_dfr(seq_len(nrow(matches)), function(i) {
  start_pos <- matches[i, "start"]
  end_pos <- if(i < nrow(matches)) matches[i + 1, "start"] - 1 else nchar(full_text)
  
  speaker_match <- str_sub(full_text, start_pos, matches[i, "end"])
  
  tibble(
    speaker_condensed = toupper(str_squish(speaker_match)),
    text = str_squish(substr(full_text, start_pos, end_pos))
  )
})

# join back

speaker_turns <- speaker_turns %>%
  left_join(
    speaker_list_unique %>% select(speaker_condensed, country, speaker),
    by = "speaker_condensed"
  )

speaker_turns

Where I use manually identified speakers names in the speaker_list_unique dataframe to search the text. My issue is that the pdf documents are in column form, but the OCR-ing is such that it reads texts across the columns. For example, as shown below, when I highlight text going down the left column, if goes across the divider and reads in sentences out of order, rather than doing the left column and then the right column. The same occurs when reading the pdf in r: even if I set it up to read in 2-column form, the reading runs across the page rather than getting the text in order.

pdf_highlight

My question is how to read the pdfs so I can create 1 comprehensive text string that runs across all the target pdf pages. My thinking would be to split each page down the divider so that the text from the right side of the page can't be read as it works through the left side and vice versa. However, there are hundreds of pages and pdfs from years 1948-2009, so I need a way to automate this. Is there a way in r to physically separate the pages and reassemble them in order so that they can be read and converted to text strings to assemble the speaker turn dataframe?

Solution

By default pdftools::pdf_text() attempts to keep physical layout, for text in raw stream order, call it with raw = TRUE . In following example the output is split into lines just for easier subsetting and to show that you'd probably want to deal with page headers ([1]) and footers ([52:53], can vary).

library(pdftools)
#> Using poppler version 25.05.0
library(stringr)
# assuming https://iris.who.int/server/api/core/bitstreams/015a25c1-aae5-4527-ba34-8f1fccf8f6a7/content 
txt_raw <- pdf_text("Official_record13_eng.pdf",  raw = TRUE)
txt_raw[[26]] |> 
  # naive dehyphenation
  str_remove_all("-\n") |> 
  str_split_1("\n") |> 
  str_view()
#>   [1] │ 24 JUNE /948 - 26 --- FIRST PLENARY MEETING
#>   [2] │ Rules of Procedure for the World Health Assembly, with the amendments as stated in document
#>   [3] │ S.58.* The Rules of Procedure are printed in
#>   [4] │ Part H of the Report of the Interim Commission
#>   [5] │ to the World Health Assembly.5 Are there any
#>   ...
#>  [50] │ members from the floor of the Assembly.
#>  [51] │ The General Committee will consist of the
#>  [52] │ 4 og. Rec. WHO, 12, 72
#>  [53] │ 6 Ibid. 19, 97
#>  [54] │ President, three vice-presidents, the chairmen of
#>  [55] │ committees, and six members from the floor.
#>   ...
#> [100] │ the questions on their agenda. The Nominations
#> [101] │ Committee should propose the candidature for
#> [102] │ the President of the Assembly by 4.30 p.m.
#> [103] │ today.
#> [104] │ The next meeting of the Assembly will take
#> [105] │ place at that time. The agenda will be as
#> [106] │ follows : the report of the Committee on Credentials, if any, and the report of the Nominations Committee with regard to the election of
#> [107] │ the President.
#> [108] │ The meeting 70S6 at 12.5 p.m.
#> [109] │

Single flat string per pdf:

txt_raw |> 
  str_remove_all("-\n") |>
  str_flatten()
#> [1] "OFFICIAL RECORDS\nOF THE\nWORLD HEALTH ORGANIZATION\nNo. 13\nFIRST\nWORLD HEALTH ASSEMBLY\nGENEVA, 24 JUNE TO 24 JULY 1948\nPLENARY MEETINGS\nVerbatim Records\nMAIN COMMITTEES\nMinutes and Reports\nSUMMARY OF RESOLUTIONS AND DECISIONS\nWORLD HEALTH ORGANIZATION\nPalais des Nations, Geneva\nDecember 1948\nFOREWORD\nArticle 2 (a) of the Arrangement concluded by the Governments represented at\n / ... / in list of observers\n(UNESCO), 19\nZozaya, J., 253, 256\nin list of delegations (Mexico), 15\nChairman of Committee on\nHeadquarters and Regional Organization, 20, 38,\n77, 8o, 330, 343\nnominated member of the\nExecutive Board, 99\nZwanck, A., 90\nin list of delegations (Argentina), 13\n"

This still includes (most) newlines, page headers and footnotes. Also be prepared for some OCR errors, for example in the last line of P26, "rose" is detected as "70S6".

In many cases pdf_data() can be more useful than pdf_text() as it allows to identify areas ( column, header / footer, paragraphs) by x,y coordinates, but in this case OCR has introduced some variations in y-values and it makes re-construction of lines bit tricky.