I try to tokenize a string such that
"word1 word2 word3 word4"
will be tokenized to 4 strings: "word1"
, "word2"
, "word3"
and "word4"
"word1 \"word2 word3\" word4"
will be tokenized to 3 strings: "word1"
, "word2 word3"
and "word4"
I have written a function tokenizeQuoted()
which does the job. That function does check after reading each token if an error occurred by checking the failbit
of the stream.
#include <cstring>
#include <cwchar>
#include <iomanip>
#include <sstream>
#include <stdexcept>
#include <string>
#include <vector>
// Tokenize a string using std::quoted
template <typename CharType>
std::vector<std::basic_string<CharType>> tokenizeQuoted(const std::basic_string<CharType> &input)
{
std::basic_istringstream<CharType> iss(input);
std::vector<std::basic_string<CharType>> tokens;
std::basic_string<CharType> token;
while (!iss.eof())
{
iss >> std::quoted(token);
if (iss.fail())
{
throw std::runtime_error("failed to tokenize string: '" + input + "'; bad bit = " + (iss.bad() ? "true" : "false"));
}
tokens.push_back(token);
}
return tokens;
}
int main() {
const std::string inputMars = "\"hello mars\"!"; // note the '!' at the end
const std::string inputEarth = "\"hello earth\"";
const auto mars = tokenizeQuoted(inputMars); // OK
const auto earth = tokenizeQuoted(inputEarth); // failbit is set
return mars.size() + earth.size();
}
In general the function works. But in case the input string ends with a quoted string (like "say \"good day\""
), the failbit is set. I would not expect that. What can I do to reliably detect errors and still be able to extract quoted strings at the end of the sequence?
This doesn't give a definite answer as to why it is happening, but I did find a workaround. Basically, I observed that even though std::quoted
is moving the position indicator to the correct place, it may not be doing an eof check and setting the bit correctly. The workaround is to check if the eof bit is set, and if it is not calling peek()
. Peeking the next value will do nothing if we are not at the end of the file, but if we are it will correctly update the bits. I prove that this works by printing the token extracted, as well as the stream bits before and after calling peek.
#include <cstring>
#include <cwchar>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <string>
#include <vector>
// Tokenize a string using std::quoted
template <typename CharType>
std::vector<std::basic_string<CharType>> tokenizeQuoted(const std::basic_string<CharType> &input)
{
std::basic_istringstream<CharType> iss(input);
std::vector<std::basic_string<CharType>> tokens;
std::basic_string<CharType> token;
while (iss >> std::ws && !iss.eof())
{
token.clear();
iss >> std::quoted(token);
if (iss.fail())
{
throw std::runtime_error("failed to tokenize string: '" + input + "'; bad bit = " + (iss.bad() ? "true" : "false"));
}
tokens.push_back(token);
std::cerr << "Token : " << token << std::endl;
std::cerr << "Before: " << iss.good() << iss.eof() << iss.fail() << iss.bad() << std::endl;
if (!iss.eof()) {
iss.peek();
}
std::cerr << "After : " << iss.good() << iss.eof() << iss.fail() << iss.bad() << std::endl << std::endl;
}
return tokens;
}
int main()
{
std::vector<std::string> inputs {
R"("hello mars"!)",
R"("hello earth")",
R"(no quotes)",
R"("unfinished quotes)",
R"(")",
R"("")",
R"(""")",
R"("""")",
R"( "leading whitespace")",
R"("trailing whitespace" )",
};
for (const auto& input : inputs)
{
std::cout << "Tokenizing '" << input << "'\n";
try {
auto tokens = tokenizeQuoted(input);
for (const auto & token : tokens)
{
std::cout << " - '" << token << "'\n";
}
} catch (std::runtime_error& e) {
std::cout << e.what() << "\n";
}
}
}
Note how the eof bit updates to the correct state after calling peek()
in the "hello earth" extraction.