c++stdstringchar-traits

Leading/trailing whitespace insensitive traits for basic_string


I am doing a lot of parsing/processing, where leading/trailing whitespace and case insensitivity is given. So I made a basic char trait for std::basic_string(see below) to save myself some work.

The trait is not working, the problem is that basic_string's compare calls the traits compare and if evaluated to 0 it returns the difference in sizes. In basic_string.h it says ...If the result of the comparison is nonzero returns it, otherwise the shorter one is ordered first. Looks like they explicitly don't want me to do this...

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0? And, is there any workaround or do I have to roll my own string?

#include <cstring>
#include <iostream>

namespace csi{
template<typename T>
struct char_traits : std::char_traits<T>
{
    static int compare(T const*s1, T const*s2, size_t n){
        size_t n1(n);
        while(n1>0&&std::isspace(*s1))
            ++s1, --n1;
        while(n1>0&&std::isspace(s1[n1-1]))
            --n1;
        size_t n2(n);
        while(n2>0&&std::isspace(*s2))
            ++s2, --n2;
        while(n2>0&&std::isspace(s2[n2-1]))
            --n2;
        return strncasecmp(static_cast<char const*>(s1),
                           static_cast<char const*>(s2),
                           std::min(n1,n2));
    }
};
using string = std::basic_string<char,char_traits<char>>;
}

int main()
{
    using namespace csi;
    string s1 = "hello";
    string s2 = " HElLo ";
    std::cout << std::boolalpha
              << "s1==s2" << " " << (s1==s2) << std::endl;
}

Solution

  • Converting data that has more than one possible representation into a "standard" or "normal" form is called canonicalization. With text it usually means removal of accents, cases, trimming white-space-characters and/or format-characters.

    If canonicalization is done under the hood during each compare then it is fragile. For example how you test that it was done correctly both to s1 and s2? Also it is inflexible, for example you can not display its result or cache it for next compare. So it is both more robust and efficient to do that as explicit canonicalization step.

    What is the reason for having this additional "shorter one" ordering if trait's compare returns 0?

    Traits compare is required to compare only n characters, so when you compare "hellow" and "hello" what it should return? It should return 0. You are in defective situation if you somehow ignore that n because the traits should work with std::string_view that is not zero-terminated. If the size compare is dropped then "hellow" and "hello" would compare equal that you likely don't want.