I have lots of line fragments like this:
...lorem ipsumMYLINEBREAK01\r \r SURNAME, Name (LT)\r \r\nMYBREAK01lorem ipsum...
It comes from processing a large html file with rvest::html_text2()
. Long story short - it is unwieldy to process the file by nodes using xml2 parser - it takes too much time. If I strip the text of HTML the text has certain regularities that can be exploited. For example I have already inserted placeholders MYBREAK01 and MYLINEBREAK01. I get a bit over my head when trying to get rid of unneeded \r
and \n
(carriage returns and linefeeds that may be interspersed with spaces - or at least they appear to be spaces).
I tried to put in %>% gsub()
in the processing chain that should get rid of these characters, but I have problems matching and I do not quite know what I am doing wrong:
gsub("(MYLINEBREAK01)(\r|\r\n| |\n)+([a-zA-Z ()]+)(\r|\r\n| \n)+(MYBREAK01)","\\1\\3\\5",.)
but it does not appear to match what I want - string fragment stays unchanged. And (LT) type thing does not always appear in the field. My aim is to get MYLINEBREAK01SURNAME, Name (LT)MYBREAK01
string, of course - without (LT) if it is not there.
I understand that as SamR and margusl have suggested I should use PCRE and that IS very helpful. However, I have also a problem that the string exists interspersed within other text and I need to restrict these changes to the area between markers MYLINEBREAK01
and MYBREAK01
. That is why I had those capture groups. Am I going in the right direction with them, or am I missing something obvious here?
Many thanks!
P.S. reason I am using rvest::html_text2
is because I am preparing data to be loaded into an access database that I made - it has some neat search and filter features that help in text analysis. So for one of the fields I am trying to preserve line breaks. In prepping text I put in 3 BREAKs and one LINEBREAK placeholders, then I replace actual remaining linebreaks with yet another placeholder 0mylinebreak0, read this into a dataframe, replace 0mylinebreak0 with \n
using stringr and then save into a csv. Which access then reads happily and I have a database of some 5k records that i can use further. Sounds cumbersome but beats manual going through the data.
str_replace_all()
with lookbehind/lookahead in the regex and with function for replacement.
library(stringr)
s_ <- "...lorem \n ipsumMYLINEBREAK01\r \r SURNAME, Name (LT)\r \r\nMYBREAK01lorem \n ipsum..."
s_ <- str_c(s_, s_)
str_view(s_)
#> [1] │ ...lorem
#> │ ipsumMYLINEBREAK01{\r} {\r} SURNAME, Name (LT){\r} {\r}
#> │ MYBREAK01lorem
#> │ ipsum......lorem
#> │ ipsumMYLINEBREAK01{\r} {\r} SURNAME, Name (LT){\r} {\r}
#> │ MYBREAK01lorem
#> │ ipsum...
# minimum number of whitespace and non-whitespace characters between
# positive lookbehind and lookahead patterns,
# match gets first passed though str_squish() and then is used for replacement
s_2 <- str_replace_all(s_, "(?<=MYLINEBREAK01)[\\s\\S]*?(?=MYBREAK01)", str_squish)
str_view(s_2)
#> [1] │ ...lorem
#> │ ipsumMYLINEBREAK01SURNAME, Name (LT)MYBREAK01lorem
#> │ ipsum......lorem
#> │ ipsumMYLINEBREAK01SURNAME, Name (LT)MYBREAK01lorem
#> │ ipsum...
Sounds like a job for stringr::str_squish()
, which is basically a fancy name for gsub("\\s+", " ", s_) |> trimws()
.
s_ <- "MYLINEBREAK01\r \r SURNAME, Name (LT)\r \r\nMYBREAK01"
stringr::str_squish(s_)
#> [1] "MYLINEBREAK01 SURNAME, Name (LT) MYBREAK01"
# which is basically
gsub("\\s+", " ", s_) |> trimws()
#> [1] "MYLINEBREAK01 SURNAME, Name (LT) MYBREAK01"
rvest
note: you might be after html_text()
(sans-2) here, and approaching this problem through rvest
with a combination of CSS selectors and/or XPath (and perhaps some xml2
tricks for some corner cases) would probably simplify your task.