rregexgsub

R: Confused over \r character replacement using gsub


I have lots of line fragments like this:

...lorem ipsumMYLINEBREAK01\r \r SURNAME, Name (LT)\r \r\nMYBREAK01lorem ipsum...

It comes from processing a large html file with rvest::html_text2(). Long story short - it is unwieldy to process the file by nodes using xml2 parser - it takes too much time. If I strip the text of HTML the text has certain regularities that can be exploited. For example I have already inserted placeholders MYBREAK01 and MYLINEBREAK01. I get a bit over my head when trying to get rid of unneeded \r and \n (carriage returns and linefeeds that may be interspersed with spaces - or at least they appear to be spaces).

I tried to put in %>% gsub() in the processing chain that should get rid of these characters, but I have problems matching and I do not quite know what I am doing wrong:

gsub("(MYLINEBREAK01)(\r|\r\n| |\n)+([a-zA-Z ()]+)(\r|\r\n| \n)+(MYBREAK01)","\\1\\3\\5",.)

but it does not appear to match what I want - string fragment stays unchanged. And (LT) type thing does not always appear in the field. My aim is to get MYLINEBREAK01SURNAME, Name (LT)MYBREAK01 string, of course - without (LT) if it is not there.

I understand that as SamR and margusl have suggested I should use PCRE and that IS very helpful. However, I have also a problem that the string exists interspersed within other text and I need to restrict these changes to the area between markers MYLINEBREAK01 and MYBREAK01. That is why I had those capture groups. Am I going in the right direction with them, or am I missing something obvious here?

Many thanks!

P.S. reason I am using rvest::html_text2 is because I am preparing data to be loaded into an access database that I made - it has some neat search and filter features that help in text analysis. So for one of the fields I am trying to preserve line breaks. In prepping text I put in 3 BREAKs and one LINEBREAK placeholders, then I replace actual remaining linebreaks with yet another placeholder 0mylinebreak0, read this into a dataframe, replace 0mylinebreak0 with \n using stringr and then save into a csv. Which access then reads happily and I have a database of some 5k records that i can use further. Sounds cumbersome but beats manual going through the data.


Solution

  • update to address question update:

    str_replace_all() with lookbehind/lookahead in the regex and with function for replacement.

    library(stringr)
    
    s_ <- "...lorem \n ipsumMYLINEBREAK01\r \r SURNAME, Name (LT)\r \r\nMYBREAK01lorem \n ipsum..."
    s_ <- str_c(s_, s_)
    str_view(s_)
    #> [1] │ ...lorem 
    #>     │  ipsumMYLINEBREAK01{\r} {\r} SURNAME, Name (LT){\r} {\r}
    #>     │ MYBREAK01lorem 
    #>     │  ipsum......lorem 
    #>     │  ipsumMYLINEBREAK01{\r} {\r} SURNAME, Name (LT){\r} {\r}
    #>     │ MYBREAK01lorem 
    #>     │  ipsum...
    
    # minimum number of whitespace and non-whitespace characters between 
    # positive lookbehind and lookahead patterns, 
    # match gets first passed though str_squish() and then is used for replacement
    s_2 <- str_replace_all(s_, "(?<=MYLINEBREAK01)[\\s\\S]*?(?=MYBREAK01)", str_squish)
    str_view(s_2)
    #> [1] │ ...lorem 
    #>     │  ipsumMYLINEBREAK01SURNAME, Name (LT)MYBREAK01lorem 
    #>     │  ipsum......lorem 
    #>     │  ipsumMYLINEBREAK01SURNAME, Name (LT)MYBREAK01lorem 
    #>     │  ipsum...
    

    inital answer:

    Sounds like a job for stringr::str_squish(), which is basically a fancy name for gsub("\\s+", " ", s_) |> trimws().

    s_ <- "MYLINEBREAK01\r \r SURNAME, Name (LT)\r \r\nMYBREAK01"
    stringr::str_squish(s_)
    #> [1] "MYLINEBREAK01 SURNAME, Name (LT) MYBREAK01"
    
    # which is basically 
    gsub("\\s+", " ", s_) |> trimws()
    #> [1] "MYLINEBREAK01 SURNAME, Name (LT) MYBREAK01"
    

    rvest note: you might be after html_text() (sans-2) here, and approaching this problem through rvest with a combination of CSS selectors and/or XPath (and perhaps some xml2 tricks for some corner cases) would probably simplify your task.