parsingrustparser-combinatorsnom

How can I match an exact tag using the nom library in Rust


I'm working on a tiny duration parsing library written in Rust, and using the nom library. In this library, I define a second parser combinator function. Its responsibility is to parse the various acceptable formats for representing seconds in a textual format.

    pub fn duration(input: &str) -> IResult<&str, std::time::Duration> {
       // Some code combining the various time format combinators
       // to match the format "10 days, 8 hours, 7 minutes and 6 seconds"  
    }

    pub fn seconds(input: &str) -> IResult<&str, u64> {
        terminated(unsigned_integer_64, preceded(multispace0, second))(input)
    }

    fn second(input: &str) -> IResult<&str, &str> {
        alt((
            tag("seconds"),
            tag("second"),
            tag("secs"),
            tag("sec"),
            tag("s"),
        ))(input)
    }

So far, the tag combinator was behaving as I expected. However, I discovered recently that the following example fails, and is by definition failing:

assert!(second("se").is_err())

Indeed, the documentation states that "The input data will be compared to the tag combinator’s argument and will return the part of the input that matches the argument".

However, as my example hopefully illustrates, what I would like to achieve is for some flavor of tag that would fail if the whole input could not be parsed. I looked into explicitly checking if there is a rest after parsing the input; and found that it would work. Also, unsuccessfully explored using some flavors of the complete and take combinators to achieve that.

What would be an idiomatic way to parse an "exact match" of a word, and fail on a partial result (that would return a rest)?


Solution

  • You can use the all consuming combinator, which succeeds if the whole input has been consumed by its child parser:

    // nom 6.1.2
    use nom::branch::alt;
    use nom::bytes::complete::tag;
    use nom::combinator::all_consuming;
    use nom::IResult;
    
    fn main() {
        assert!(second("se").is_err());
    }
    
    fn second(input: &str) -> IResult<&str, &str> {
        all_consuming(alt((
            tag("seconds"),
            tag("second"),
            tag("secs"),
            tag("sec"),
            tag("s"),
        )))(input)
    }
    

    Update

    I think I misunderstood your original question. Maybe this is closer to what you need. The key is that you should write smaller parsers, and then combine them:

    use nom::branch::alt;
    use nom::bytes::complete::tag;
    use nom::character::complete::digit1;
    use nom::combinator::all_consuming;
    use nom::sequence::{terminated, tuple};
    use nom::IResult;
    
    #[derive(Debug)]
    struct Time {
        min: u32,
        sec: u32,
    }
    
    fn main() {
        //OK
        let parsed = time("10 minutes, 5 seconds");
        println!("{:?}", parsed);
    
        //OK
        let parsed = time("10 mins, 5 s");
        println!("{:?}", parsed);
    
        //Error -> although `min` is a valid tag, it would expect `, ` afterwards, instead of `ts`
        let parsed = time("10 mints, 5 s");
        println!("{:?}", parsed);
    
        //Error -> there must not be anything left after "5 s"
        let parsed = time("10 mins, 5 s, ");
        println!("{:?}", parsed);
    
        // Error -> although it starts with `sec` which is a valid tag, it will fail, because it would expect EOF
        let parsed = time("10 min, 5 sections");
        println!("{:?}", parsed);
    }
    
    fn time(input: &str) -> IResult<&str, Time> {
        // parse the minutes section and **expect** a delimiter, because there **must** be another section afterwards
        let (rem, min) = terminated(minutes_section, delimiter)(input)?;
    
        // parse the minutes section and **expect** EOF - i.e. there should not be any input left to parse
        let (rem, sec) = all_consuming(seconds_section)(rem)?;
    
        // rem should be empty slice
        IResult::Ok((rem, Time { min, sec }))
    }
    
    // This function combines several parsers to parse the minutes section:
    // NUMBER[sep]TAG-MINUTES
    fn minutes_section(input: &str) -> IResult<&str, u32> {
        let (rem, (min, _sep, _tag)) = tuple((number, separator, minutes))(input)?;
    
        IResult::Ok((rem, min))
    }
    
    // This function combines several parsers to parse the seconds section:
    // NUMBER[sep]TAG-SECONDS
    fn seconds_section(input: &str) -> IResult<&str, u32> {
        let (rem, (sec, _sep, _tag)) = tuple((number, separator, seconds))(input)?;
    
        IResult::Ok((rem, sec))
    }
    
    fn number(input: &str) -> IResult<&str, u32> {
        digit1(input).map(|(remaining, number)| {
            // it can panic if the string represents a number
            // that does not fit into u32
            let n = number.parse().unwrap();
            (remaining, n)
        })
    }
    
    fn minutes(input: &str) -> IResult<&str, &str> {
        alt((
            tag("minutes"),
            tag("minute"),
            tag("mins"),
            tag("min"),
            tag("m"),
        ))(input)
    }
    
    fn seconds(input: &str) -> IResult<&str, &str> {
        alt((
            tag("seconds"),
            tag("second"),
            tag("secs"),
            tag("sec"),
            tag("s"),
        ))(input)
    }
    
    // This function parses the separator between the number and the tag:
    //N<separator>tag -> 5[sep]minutes
    fn separator(input: &str) -> IResult<&str, &str> {
        tag(" ")(input)
    }
    
    // This function parses the delimiter between the sections:
    // X minutes<delimiter>Y seconds -> 1 min[delimiter]2 sec
    fn delimiter(input: &str) -> IResult<&str, &str> {
        tag(", ")(input)
    }
    

    Here I have created a set of basic parsers for the building blocks, such as "number", "separator", "delimiter", the various markers (min, sec, etc). None of those expect to be "whole words". Instead you should use combinators, such as terminated, tuple, all_consuming to mark where the "exact word" ends.