Using latest (v7) nom crate.
Trying to build a parser capable of extracting code blocks from markdown. In the flavor of markdown I need to support, a code block only ends if there is three grave/backtick characters on a line by themselves, excepting perhaps followed by whitespace.
Here is an example, where I replace backticks with single quotes (') to make editing with the StackOverflow markdown sane:
'''python
print("""
'''")
// this is all still a code block
'''
The obvious solution is to just use take_until("'''")
however, that will end the take early, since that just does a search for the first occurrence of '''
which is not accurate. I need the termination condition to be tuple((tag(code_end), space0, newline))
.
The next obvious solution is to use regular expressions as the pattern in take_until
... but I would prefer to avoid that. Is there any prebuilt parser (or available in another crate) that will take all until a parser returns Ok
?
use nom::IResult;
use nom::combinator::opt;
use nom::sequence::{terminated, tuple};
use nom::bytes::complete::{tag, take_until};
use nom::character::complete::{newline, space0, alpha1};
fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
let terminator = tuple((tag("'''"), space0, newline));
let (input, contents) = terminated(take_until("'''"), terminator)(input)?;
Ok((input, contents))
}
fn main() {
let test = &b"'''python
print(\"\"\"
'''\"\"\"
// this is all still a code block
'''
";
assert!(code(&test[..]).is_ok());
}
the above assertion will fail. However, if you remove the line with the three (''') single quotes, it will pass. This is because of the difference between terminator
and take_all("'''")
. What is my best pattern for solving this problem?
Thanks for any help. I have a feeling I'm missing something obvious or just doing something wrong. Let me know if anything isn't clear.
Here is a link to the above example in the Rust Playground for convenience: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d5459edded1e4258ba3e034658ea4acf
I think the proper combinator would be many_till
:
Applies the parser f until the parser g produces a result.
That combined with anychar
will return a Vec<char>
for your code block.
I think there is no anybyte
in nom, but you can easily write it yourself if you prefer to get Vec<u8>
.
Or if you want to avoid allocating and want a slice referencing to the original slice, and don't mind a bit of unsafe you can ignore the consumed characters and take compute the slice from the start and end pointers (playground):
fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
let terminator = tuple((tag("'''"), space0, newline));
let start = input;
let (input, (_, (end, _, _))) = many_till(map(anychar, drop), terminator)(input)?;
let len = unsafe { end.as_ptr().offset_from(start.as_ptr()) as usize};
Ok((input, &start[..len]))
}