I have a string that looks like this:
t2 <- "============================================
Model 1 Model 2
--------------------------------------------
education 3.66 *** 2.80 ***
(0.65) (0.59)
income 1.04 *** 0.85 ***
(0.26) (0.23)
type: blue collar -5.91 -27.55 ***
(3.94) (5.41)
type: white collar -8.82 ** -24.12 ***
(2.79) (5.35)
income x blue collar 3.01 ***
(0.58)
income x white collar 1.91 *
(0.81)
prop. female 0.01 0.08 *
(0.03) (0.03)
--------------------------------------------
R^2 0.83 0.87
Adj. R^2 0.83 0.86
Num. obs. 98 98
============================================
*** p < 0.001, ** p < 0.01, * p < 0.05"
and I'm trying to extract the left hand column so that I get a vector that looks like this:
education
income
type: blue collar
type: white collar
income x blue collar
income x white collar
prop. female
I'm new to regex
and stringr
, and I'm trying to extract the words that follow a linebreak:
library(stringr)
covariates <- str_extract_all(t2, "\n\\w+")
covariates
which is getting me a bit closer:
[1] "\neducation" "\nincome" "\ntype" "\ntype" "\nincome" "\nincome" "\nprop" "\nR"
[9] "\nAdj" "\nNum"
but I can't work out how to capture the entire column of text eg, getting the full "type: blue collar", instead of "\ntype".
You may use
covariates <- str_extract_all(
str_match(t2, "(?ms)^-{3,}\n(.*?)\n-{3,}$")[,2],
"(?m)^\\S.*?(?=\\h{2})"
)
Or, to make it work much faster, use these unrolled patterns:
covariates <- str_extract_all(
str_match(t2, "(?m)^-{3,}\n(.*(?:\n(?!-{3,}$).*)*)\n-{3,}$")[,2],
"(?m)^\\S\\H*(?:\\h(?!\\h)\\H*)*"
)
With str_match(t2, "(?ms)^-{3,}\n(.*?)\n-{3,}$")[,2]
, you extract all text between two lines that are made of 3 or more hyphens. Here are that pattern details:
(?ms)
- multiline (making ^
match start of a line and $
match end of line) and singleline/dotall (making .
match line breaks, too) modes on -^
- start of a line-{3,}
- three or more hyphens \n
- a newline(.*?)
- Group 1: any 0+ chars but as few as possible\n
- a newline-{3,}
- three or more hyphens $
- end of line.The (?m)^\\S.*?(?=\\h{2})
is used later on that part of the string and matches
(?m)
- multiline mode on^
- start of a line\\S
- a non-whitespace char.*?
- any 0+ chars other than line break chars, as few as possible(?=\\h{2})
- immediately to the right of the current location, there must be 2 horizontal whitespaces.