
Rust Nom take_until with parser and not pattern

Using latest (v7) nom crate.

Trying to build a parser capable of extracting code blocks from markdown. In the flavor of markdown I need to support, a code block only ends if there is three grave/backtick characters on a line by themselves, excepting perhaps followed by whitespace.

Here is an example, where I replace backticks with single quotes (') to make editing with the StackOverflow markdown sane:

// this is all still a code block

The obvious solution is to just use take_until("'''") however, that will end the take early, since that just does a search for the first occurrence of ''' which is not accurate. I need the termination condition to be tuple((tag(code_end), space0, newline)).

The next obvious solution is to use regular expressions as the pattern in take_until... but I would prefer to avoid that. Is there any prebuilt parser (or available in another crate) that will take all until a parser returns Ok?

use nom::IResult;
use nom::combinator::opt;
use nom::sequence::{terminated, tuple};
use nom::bytes::complete::{tag, take_until};
use nom::character::complete::{newline, space0, alpha1};

fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
    let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
    let terminator = tuple((tag("'''"), space0, newline));
    let (input, contents) = terminated(take_until("'''"), terminator)(input)?;
    Ok((input, contents))

fn main() {
    let test = &b"'''python
// this is all still a code block

the above assertion will fail. However, if you remove the line with the three (''') single quotes, it will pass. This is because of the difference between terminator and take_all("'''"). What is my best pattern for solving this problem?

Thanks for any help. I have a feeling I'm missing something obvious or just doing something wrong. Let me know if anything isn't clear.

Here is a link to the above example in the Rust Playground for convenience:


  • I think the proper combinator would be many_till:

    Applies the parser f until the parser g produces a result.

    That combined with anychar will return a Vec<char> for your code block.

    I think there is no anybyte in nom, but you can easily write it yourself if you prefer to get Vec<u8>.

    Or if you want to avoid allocating and want a slice referencing to the original slice, and don't mind a bit of unsafe you can ignore the consumed characters and take compute the slice from the start and end pointers (playground):

    fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
        let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
        let terminator = tuple((tag("'''"), space0, newline));
        let start = input;
        let (input, (_, (end, _, _))) = many_till(map(anychar, drop), terminator)(input)?;
        let len = unsafe { end.as_ptr().offset_from(start.as_ptr()) as usize};
        Ok((input, &start[..len]))