c++stringstreamstringtokenizerstring-view

Inconsistent output from gcount()


I have written the following simple MRE that regenerates a bug in my program:

#include <iostream>
#include <utility>
#include <sstream>
#include <string_view>
#include <array>
#include <vector>
#include <iterator>

// this function is working fine only if string_view contains all the user provided chars and nothing extra like null bytes
std::pair< bool, std::vector< std::string > > tokenize( const std::string_view inputStr, const std::size_t expectedTokenCount )
{
    // unnecessary implementation details

    std::stringstream ss;
    ss << inputStr.data( ); // works for null-terminated strings, but not for the non-null terminated strings

    // unnecessary implementation details
}

int main( )
{
    constexpr std::size_t REQUIRED_TOKENS_COUNT { 3 };
    std::array<char, 50> input_buffer { };

    std::cin.getline( input_buffer.data( ), input_buffer.size( ) ); // user can enter at max 50 characters

    const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), input_buffer.size( ) }, REQUIRED_TOKENS_COUNT ) };

    for ( const auto& token : foundTokens ) // print the tokens
    {
        std::cout << '\'' << token << "' ";
    }

    std::cout << '\n';
}

This is a program for tokenization (for full code see Compiler Explorer at the link below). Also, I use GCC v11.2.

First of all, I want to avoid using data() since it's a bit less efficient.

I looked at the assembly in Compiler Explorer and apparently, data() calls strlen() so when it reaches the first null byte it stops. But what if the string_view object is not null-terminated? That's a bit concerning. So I switched to ss << inputStr;.

Secondly, when I do this ss << inputStr;, the whole 50 character buffer is inserted into ss with all of its null bytes. Below are some sample outputs that are wrong:

sample #1:

1                  2    3
'1' '2' '3                                     ' // '1' and '2' are correct, '3' has lots of null bytes

sample #2 (in this one I typed a space character after 3):

1                  2    3
'1' '2' '3' '                                     ' // an extra token consisting of 1 space char and lots of null bytes has been created!

Is there a way to fix this? What should I do now to also support non-null terminated strings? I came up with the idea of gcount() as below:

    const std::streamsize charCount { std::cin.gcount( ) };
                                                                                        // here I pass charCount instead of the size of buffer
    const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), charCount },
                                                                    REQUIRED_TOKENS_COUNT ) };

But the problem is that when the user enters less characters than the buffer size, gcount() returns a value that is 1 more than the actual number of entered chars (e.g. user enters 5 characters but gcount returns 6 apparently also taking '\0' into account).

This causes the last token to also have a null byte at its end:

1   2     3
'1' '2' '3 ' // see the null byte in '3 ', it's NOT a space char

How should I fix gcount's inconsistent output?

Or maybe I should change the function tokenize so that it gets rid of any '\0' at the end of the string_view and then starts to tokenize it.

It might sound like an XY problem though. But I really need help to decide what to do.


Solution

  • The basic problem you have is with the operator<< functions. You've tried two of them:

    It seems that what you want to do is take characters from the string_view up to (and not including) the first NUL or the end of the string_view, whichever comes first. You can do that with find and constructing a substr up to the NUL or end:

    ss << inputStr.substr(0, inputStr.find('\0'));