rustmarkdownnom

Rust Nom take_until with parser and not pattern


Using latest (v7) nom crate.

Trying to build a parser capable of extracting code blocks from markdown. In the flavor of markdown I need to support, a code block only ends if there is three grave/backtick characters on a line by themselves, excepting perhaps followed by whitespace.

Here is an example, where I replace backticks with single quotes (') to make editing with the StackOverflow markdown sane:

'''python
print("""
'''")
// this is all still a code block
'''

The obvious solution is to just use take_until("'''") however, that will end the take early, since that just does a search for the first occurrence of ''' which is not accurate. I need the termination condition to be tuple((tag(code_end), space0, newline)).

The next obvious solution is to use regular expressions as the pattern in take_until... but I would prefer to avoid that. Is there any prebuilt parser (or available in another crate) that will take all until a parser returns Ok?

use nom::IResult;
use nom::combinator::opt;
use nom::sequence::{terminated, tuple};
use nom::bytes::complete::{tag, take_until};
use nom::character::complete::{newline, space0, alpha1};

fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
    let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
    let terminator = tuple((tag("'''"), space0, newline));
    let (input, contents) = terminated(take_until("'''"), terminator)(input)?;
    Ok((input, contents))
}

fn main() {
    let test = &b"'''python
print(\"\"\"
'''\"\"\"
// this is all still a code block
'''
";
    assert!(code(&test[..]).is_ok());
}

the above assertion will fail. However, if you remove the line with the three (''') single quotes, it will pass. This is because of the difference between terminator and take_all("'''"). What is my best pattern for solving this problem?

Thanks for any help. I have a feeling I'm missing something obvious or just doing something wrong. Let me know if anything isn't clear.

Here is a link to the above example in the Rust Playground for convenience: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d5459edded1e4258ba3e034658ea4acf


Solution

  • I think the proper combinator would be many_till:

    Applies the parser f until the parser g produces a result.

    That combined with anychar will return a Vec<char> for your code block.

    I think there is no anybyte in nom, but you can easily write it yourself if you prefer to get Vec<u8>.

    Or if you want to avoid allocating and want a slice referencing to the original slice, and don't mind a bit of unsafe you can ignore the consumed characters and take compute the slice from the start and end pointers (playground):

    fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
        let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
        let terminator = tuple((tag("'''"), space0, newline));
        let start = input;
        let (input, (_, (end, _, _))) = many_till(map(anychar, drop), terminator)(input)?;
        let len = unsafe { end.as_ptr().offset_from(start.as_ptr()) as usize};
        Ok((input, &start[..len]))
    }