regexrecursive-regex

Matching *consecutive* lines that begin with an arbitrary amount of whitespace followed by a character


I am trying to match consecutive lines that starts with an arbitrary amount of space followed by the character |. I am using the s flag, so that . matches newlines.

What I have so far works with a finite amount of whitespace before |.

I am having issues with the part that determines that a line is reached that does not meet the requirements. For some reason \n\s*[^\|] does not do the trick. What I am doing right now is the following:

(?P<terminating>
    \n(             # when newline is encountered...
        [^\|\s]         #   check if next character is not: (| or space)
        |
        [\s][^\|\s]     #   check if next characters are not: space + (| or space)
        |
        [\s][\s][^\|\s] #   check if next characters are not: space + space + (| or space)... And so on....
    )
    |
    $
)

This obviously only works for two spaces. I would like to make this work for an arbitrary amount of spaces. I looked into recursion, but it seems like that is quite the heavy gun to wield in this case. Here now is my question: Why does \n\s*[^\|] not work, and is there another way of solving this without recursion?


Below is an example of input and the resulting match I would like to get:

Input string:

Lorem ipsum dolor sit amet, 
consectetur adipisicing 
elit, 
|sed do 
        |eiusmod tempor incididunt 
     |ut labore et dolore magna aliqua.
Ut enim ad minim veniam, 
quis nostrud exercitation 
ullamco laboris nisi ut 
aliquip ex ea commodo consequat.

Output is one string with content:

|sed do\n        |eiusmod tempor incididunt\n     |ut labore et dolore magna aliqua.

I don't want three matches with each of the lines that have | in it.


Solution

  • I solved it myself. I guess I have to exclude the space from the character group I am excluding:

    n\s*[^\|\s]
    

    Not quite sure why this is though, I stumbled upon this by sheer accident. I would be grateful if someone could explain the reasoning behind this.

    The full expression now is as follows:

    '/
        (?:
            (^|\n)\s*\|
        )
        (?P<main>
            .*?
        )
        (?=
            \n\s*[^\|\s]
            |
            $
        )
    /sx'