I am reading a pdf in r
using library(pdftools)
library(tidyverse)
library(pdftools)
library(lubridate)
pdf_rowwise <- strsplit(pdf_text("V://path//sample.pdf"), split = "\n")
class(pdf_rowwise[[1]][8:18])
output: [1] "character"
Now taking a sample from this pdf
pdf_rowwise[[1]][8:18]
[1] "Test Name Result Biological Ref. Int. Unit"
[2] ""
[3] " 100 TEST AAROGYA 2.0"
[4] " THYROID PROFILE,Serum"
[5] "TOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml"
[6] " (Method : CLIA)"
[7] ""
[8] "TOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL"
[9] " (Method : CLIA)"
[10] ""
[11] "THYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL"
I have also saved above output as text file at https://raw.githubusercontent.com/johnsnow09/stackover_doubts/main/sample_pdf_text.txt
Above text or text file can be used as a source of data and from this I am trying to extract data (line No 5,8,11) as 3 or 4 columns as dataframe from this text.
Desired Output:
I have tried few codes below but none of them is working for me:
strsplit(pdf_rowwise[[1]][8:18], split = "\t")
pdf_rowwise[[1]][8:18] %>% as.tibble()
# this combines everything into 1 column dataframe
# below codes also doesn't work
strsplit(pdf_rowwise[[1]][8:18], split = "\t") %>% as.tibble()
strsplit(pdf_rowwise[[1]][8:18], split = "\t") %>% list2DF()
str_split_fixed(pdf_rowwise[[1]][8:18]," ",2)
# not giving what I expected
I am New to this sort of parsing and extraction so not sure which library & functions are best suited for this work.
UPDATE:
I am also trying to use tabulapdf
and have noticed \r
. Could this be of any use for column separation ?
library(tabulapdf)
strsplit(tabulapdf::extract_text("V:path//sample.pdf"),'\n')
[[1]]
[1] "100 TEST AAROGYA 2.0\r"
[2] "THYROID PROFILE,Serum\r"
[3] "TOTAL TRI IODOTHYRONINE - T3\r"
[4] "(Method : CLIA)\r"
[5] "0.89 0.80-2.0 ng/ml\r"
[6] "TOTAL THYROXINE - T4\r"
[7] "(Method : CLIA)\r"
[8] "8.64 6.09 - 12.23 ug/dL\r"
[9] "THYROID STIMULATING HORMONE - TSH\r"
[10] "(Method : CLIA)\r"
[11] "5.660H 0.35 - 5.50 uIU/mL\r"
Sample Text form:
tabulapdf::extract_text("V:path//sample.pdf")
[1] "100 TEST AAROGYA 2.0\r\nTHYROID PROFILE,Serum\r\nTOTAL TRI IODOTHYRONINE - T3\r\n(Method : CLIA)\r\n0.89 0.80-2.0 ng/ml\r\nTOTAL THYROXINE - T4\r\n(Method : CLIA)\r\n8.64 6.09 - 12.23 ug/dL\r\nTHYROID STIMULATING HORMONE - TSH\r\n(Method : CLIA)\r\n5.660H 0.35 - 5.50 uIU/mL\r\nPregnancy reference ranges for TSH\r\n1st Trimester : 0.10 - 2.50\r\n2nd Trimester : 0.20 - 3.00\r\n3rd Trimester : 0.30 - 3.00\r\nReference: Guidelines of American Thyroid Association for the Diagnosis and Management of Thyroid Disease During Pregnancy\r\nand Postpartum, Thyroid, 2011, 21; 1-46\r\nCOMMENTS:\r\nThe levels of Thyroid hormones (T3, T4 & FT3, FT4) are low in case of Primary, Secondary and Tertiary hypothyroidism and\r\nsometimes in nonthyroidal illness also.
# pdf text read results
pdf_text("V://path//sample.pdf")
output:
Test Name Result Biological Ref. Int. Unit\n\n 100 TEST AAROGYA 2.0\n THYROID PROFILE,Serum\nTOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml\n (Method : CLIA)\n\nTOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL\n (Method : CLIA)\n\nTHYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL\n (Method : CLIA)\n\nPregnancy reference ranges for TSH\n1st Trimester : 0.10 - 2.50\n2nd Trimester : 0.20 - 3.00\n3rd Trimester : 0.30 - 3.00\nReference: Guidelines of American Thyroid Association for the Diagnosis and Management of Thyroid Disease During Pregnancy\nand Postpartum, Thyroid, 2011, 21; 1-46\n\nCOMMENTS:\nThe levels of Thyroid hormones (T3, T4 & FT3, FT4) are low in case of Primary, Secondary and Tertiary hypothyroidism and\nsometimes in nonthyroidal illness also. Increase levels are found in Grave’s disease, Hyperthyroidism and Thyroid Hormone\nresistance. TSH levels are raised in Primary Hypothyroidism and are low in Hyperthyroidism and secondary hypothyroidism.\n\nNOTE:\nTSH levels are subject to circadian variation, reaction peak levels between 2-4 am and at a minimum between 6-10 pm. The\nvariation is of the day has influence on the measured serum TSH concentrations.\nTSH values <0.03 uIU/ml need to be clinically correlated due to presence of a rare TSH variant in some individuals.\n\n\n\n\n Page 1 of 18\n"
Based on the input in the Note at the end, split it on 4 or more spaces to a list, extract list elements with 4 fields, paste the fields together with comma separators (since comma does not appear in the data), convert from a list to a character vector and read in using read.csv
. No packages are used.
txt |>
strsplit(" +") |>
Filter(f = \(x) length(x) == 4) |>
lapply(paste, collapse = ",") |>
do.call(what = "c") |>
read.csv(text = _, check.names = FALSE)
giving
Test Name Result Biological Ref. Int. Unit
1 TOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml
2 TOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL
3 THYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL
Input used
txt <- c("Test Name Result Biological Ref. Int. Unit",
"", " 100 TEST AAROGYA 2.0",
" THYROID PROFILE,Serum",
"TOTAL TRI IODOTHYRONINE - T3 0.89 0.80-2.0 ng/ml",
" (Method : CLIA)", "", "TOTAL THYROXINE - T4 8.64 6.09 - 12.23 ug/dL",
" (Method : CLIA)", "", "THYROID STIMULATING HORMONE - TSH 5.660H 0.35 - 5.50 uIU/mL"
)