I am trying to identify a pattern across multiple lines, to be exact 2 lines. Since the pattern in either individual line is not unique I am using this approach.
So far I have tried to go with the function "grep" but I think I am missing the correct regular expression here.
grep("^Item\\s{0,}2[^A]", f.text, ignore.case = TRUE)
This part is a modified version of the edgar package function "getfillings" and tries to extract only the Management's Comment/Item 2 for quarterly results. If possible I would include something after ... 2[^A]
in the function that reacts to the new line and then the string "Management..."
The pattern in the plain txts which I have, looks like this:
Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations
I would appreciate any comment on how to capture this best in a regular expression with R.
Example Input looks like this:
21
Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations
This section and other parts of this Quarterly Report on Form 10
Item 3.
Quantitative and Qualitative Disclosures About Market Risk
There have been no material changes to the Company market risk
and the desired output would be
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10
I need to match "Item 2. ... Management Discussion" since Item 2 is not unique. How could I formulate a regular expression across two lines?
Not very sophisticated since I'm no expert in string manipulation: Using package tidyverse
provides some powerful tools to get your desired output.
text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk Item 4.
Fluffy Text example Item 5.
Lorem ipsum dolor sit amet, consectetur adipisici elit"
Now
text %>%
str_extract_all("(?<=Item\\s\\d[[:punct:]]\\n).*", simplify = TRUE) %>%
str_remove("\\s+Item\\s\\d[[:punct:]]")
gives you
[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"
[2] "Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"
[3] "Fluffy Text example"
[4] "Lorem ipsum dolor sit amet, consectetur adipisici elit"
If you just want to extract Item 2, replace the \\d
inside str_extract_all
with 2.