
Split string to columns based on paragraph ending from ocr'd image

I'm working on a project to convert type-writer written War Diary notes into text, from PDF scans. I can successfully (maybe 90% with original non-re-sized file) extract the main text, which I crop first.

Reprex data: You could try this from the beginning with the image, or with the text I provide below.

My challenge is to maintain the "daily" structure of the text, which has 7 paragraphs or sections, one per day, and splitting by "\n" or "\n\n" isn't working exactly right.

enter image description here

I'm using a combination of pdftools/stringr/tesseract/magick for the project:



image <- image_read("./test-data/page_1.png") #change to your path

text -> image %>%
  image_crop(geometry_area(width = 1220, height = 900,
                           y_off = 260, x_off = 355)) %>% 
  image_resize("2000x") %>%
  image_convert(type = 'Grayscale') %>%
  image_trim(fuzz = 40) %>%
  image_write(format = 'png', density = '300x300') %>%

gives a string:

[1] "Weather clear all day. A smaii arms inspection hela at i400 hrs. A recce party went\nlout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nlfor B Coy personnel by our YMCA Supervisor. o\nWeather clear and cold all day. Personnel packed equipment early in the morning and |-~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,.\nPW brought in by A Coy at 1900 hrs. PW was a deserter from 304 Regt 2 Pz division. | .\n© other activity during the day. Patrols were sent out during the night by all coys,.\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chatea\nVieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys. 7\nHeather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\natrols sent out byall coys. 7\neather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\ning but the CO cancelled it. Two Polish deserters from the German army walked into\nCoy lines. 4\neather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\new location at 0830 hrs. Unit started to move to new location at 1200 hrs. Unit a\nrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets. | «\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.) ,\nQuiet all day. No enemy activity during the day.\neather overcast and snowing. Intelligence Section set up another OP at MR 268814,\no enemy activity during thle day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945. ;\nWeather clear and cold. Bn started to move at 0830 hrs. Bn reached Champlon ‘\nFamenine, MR qibe at 1230 hrs. Bn relieved the HLI. Coys immediately took up\npositions or all around defence.\n"

Using stringr, this can be split approximately by sentence ends:

stringr::str_split(text, pattern = "\n")

 [1] "Weather clear all day. A smaii arms inspection hela at i400 hrs. A recce party went"    
 [2] "lout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group"   
 [3] "held in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown"    
 [4] "lfor B Coy personnel by our YMCA Supervisor. o"                                         
 [5] "Weather clear and cold all day. Personnel packed equipment early in the morning and |-~"
 [6] "were ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR"   
 [7] "2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,." 
 [8] "PW brought in by A Coy at 1900 hrs. PW was a deserter from 304 Regt 2 Pz division. | ." 
 [9] "© other activity during the day. Patrols were sent out during the night by all coys,."  
[10] "Cold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chatea"     
[11] "Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at"       
[12] "\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP"    
[13] "set up at 1900 hrs at MR 207753. Patrols sent out by all Coys. 7"                       
[14] "Heather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR"        
[15] "204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce"         
[16] "atrols sent out byall coys. 7"                                                          
[17] "eather fair all day. No enemy was seen during the day. A Coy sent out patrols during"   
[18] "tthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders"    
[19] "Group held at 2000 hrs and orders were given to have all personnel ready to move to"    
[20] "new location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp"   
[21] "ing but the CO cancelled it. Two Polish deserters from the German army walked into"     
[22] "Coy lines. 4"                                                                           
[23] "eather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce"     
[24] "ew location at 0830 hrs. Unit started to move to new location at 1200 hrs. Unit a"      
[25] "rived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to"       
[26] "their areas and billets. | «"                                                           
[27] "eather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.) ," 
[28] "Quiet all day. No enemy activity during the day."                                       
[29] "eather overcast and snowing. Intelligence Section set up another OP at MR 268814,"      
[30] "o enemy activity during thle day. At 2300 hrs orders were received that all personnel"  
[31] "ere to be ready to move to new area on the morning of the 9th Jan, 1945. ;"             
[32] "Weather clear and cold. Bn started to move at 0830 hrs. Bn reached Champlon ‘"          
[33] "Famenine, MR qibe at 1230 hrs. Bn relieved the HLI. Coys immediately took up"           
[34] "positions or all around defence."                                                       
[35] ""          

Any ideas how I can improve this to find a specific pattern at the end of each?

I may just export the split how it is and copy/paste manually from here in word.

Thank you very much!!


  • An option would be to use preserve_interword_spaces in order to preserve the large amount of spacing between paragraphs. Then, you can use stringr to split on spaces of a certain amount or greater

    text <- image %>%
      image_crop(geometry_area(width = 1220, height = 900,
                               y_off = 260, x_off = 355)) %>% 
      image_resize("2000x") %>%
      image_convert(type = 'Grayscale') %>%
      image_trim(fuzz = 40) %>%
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
    stringr::str_split(text, pattern = "\\s{5,}")

    This will roughly split out the majority of the paragraphs in the example:

    [1] "Weather clear all day. A smail arms inspection heia at 1400 hrs. A recce party went\nlout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nlfor B Coy personnel by our YMCA Supervisor."                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
    [2] "“\nWeather clear and cold all day. Personnel packed equipment early in the morning and |-~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,\nPw brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division. | .\n© other activity during the day. Patrols were sent out during the night by all coys,.\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chatea\nVieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n(Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
    [3] "i\nHeather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge IR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
    [4] "‘\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols duri\nthe day and night but no opposition meat. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\ning but the CO cancelled it. Two Polish deserters from the German army walked into\nCoy lines."                                                                                                                                                                                                                                                                                                                           
    [5] "4\neather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\new location at 0830 hrs. Unit started to move to new location at 1200 hrs. Unit    4\nrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
    [6] ":\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.\nQuiet all day. No enemy activity during the day.\nleather overcast and snowing. Intelligence Section set up another OP at MR 268814.\n© enemy activity during thle day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945.\nather clear and cold. Bn started to move at 0830 hrs. Bn reached Champlon"                                                                                                                                                                                                                                                                                                        
    [7] "‘\nFPamenine, MR qis2 at 1230 hrs. Bn relieved the HLI. Coys immediately took up\npositions  or all around defence.\n"