parsingrustnom

What's the proper way to parse text with tag using Rust nom?


I want to parse text with tag. For example, for the string aa<haha>test 1 2 3</haha> string 2, the result should be the string aa, the tag haha with content test 1 2 3, and a string string 2. Below is my code. It works for now. But I'm sure it's not the best way to solve the problem. Could anyone help me to find a better way to solve the problem or at least simplify the code? Thanks.

use nom::branch::alt;
use nom::bytes::complete::{*};
use nom::IResult;
use nom::sequence::{delimited, separated_pair};


#[derive(Debug)]
pub enum ElementType {
    Text(String),
    Tag(String, String),
}


fn parse_element_type_text(input: &str) -> IResult<&str, ElementType> {
    let (remaining, text) = take_till(|c: char| c == '<')(input)?;
    if text.is_empty() {
        Err(nom::Err::Error(nom::error::Error::new(input, nom::error::ErrorKind::Eof)))
    } else {
        Ok((remaining, ElementType::Text(text.to_string())))
    }
}


fn parse_element_type_tag(input: &str) -> IResult<&str, ElementType> {
    let (left, tag_name) = delimited(
        tag("<"),
        take_until(">"),
        tag(">"),
    )(input)?;
    let (left, content) = take_until("</")(left)?;
    let (left, tag_name2) = delimited(
        tag("</"),
        tag(tag_name),
        tag(">"),
    )(left)?;

    Ok((left, ElementType::Tag(tag_name.to_string(), content.to_string())))
}


fn parse_element(input: &str) -> IResult<&str, Vec<ElementType>> {
    let mut elements = vec![];
    let mut input = input;

    loop {
        let original_input = input;
        match alt((parse_element_type_tag, parse_element_type_text))(input) {
            Ok((remaining_input, element)) => {
                elements.push(element);
                input = remaining_input;
            }
            Err(_) => break,
        }
        if original_input == input {
            break;
        }
    }

    Ok((input, elements))
}


fn main() {
    let text = r#"<foo>some more text</foo> even more text!<tag2>test haha</tag2>"#;
    let result = parse_element(text);
    println!("{:?}", result);
}

Solution

  • As a side note, you should always include links to the playground with questions like this. If you don't know, the rust playground can be found here and allows one to run code in the browser and to share it with others. You can share specific snippets via the share button in the top right, and then saving the permalink.

    The main problem with your approach that it doesn't deal with recursive tag structures. Tags may contain other sub-tags, correct? If so, you run into the problem with take_until("</") that is mentioned by cafce25 - it will skip any other tags:

    fn main() {
        let text = r#"<foo><bar>hello</bar></foo>"#;
        let result = parse_element(text);
        println!("{:?}", result);
    }
    

    Playground

    The above snippet returns Ok(("<foo><bar>hello</bar></foo>", [])) as the closing tag "<\bar>" does not match the opening tag "<foo>".

    The problem runs deeper than just the use of take_until, however. In order to support nested tags, your enum must actually be recursively defined:

    #[derive(Debug)]
    pub enum ElementType {
        Text(String),
        Tag(String, Vec<ElementType>),
    }
    

    (note that the compiler allows this because while ElementType is containing itself recursively, it is behind a Vec<ElementType>, which acts the same way a Box<ElementType> would - the actual inner ElementType is stored on the heap, with the original type merely containing a pointer to children ElementTypes)

    This is because tags may contain any amount of sub-tags. Thus, parsing must be done recursively:

    fn parse_element_type_tag(input: &str) -> IResult<&str, ElementType> {
        let (left, tag_name) = delimited(
            tag("<"),
            take_until(">"),
            tag(">"),
        )(input)?;
        let (left, content) = parse_element(left)?;
        let (left, tag_name2) = delimited(
            tag("</"),
            tag(tag_name),
            tag(">"),
        )(left)?;
    
        Ok((left, ElementType::Tag(tag_name.to_string(), content)))
    }
    

    Playground

    The downside to using Vec is that it becomes more complex to traverse the tree created, which can be seen in the printed value of the example using recursive tags:

    Ok(("", [Tag("foo", [Tag("bar", [Text("hello")])])]))

    This could be remedied by adding additional enum variants for tags that contain nothing and tags that contain only a single element, however this would depend highly upon your use-case, and may make things even more complex.