c++regexstringc++11

std::regex escape special characters for use in regex


I'm string to create a std::regex(__FILE__) as part of a unit test which checks some exception output that prints the file name.

On Windows it fails with:

regex_error(error_escape): The expression contained an invalid escaped character, or a trailing escape.

because the __FILE__ macro expansion contains un-escaped backslashes.

Is there a more elegant way to escape the backslashes than to loop through the resulting string (i.e. with a std algorithm or some std::string function)?


Solution

  • File paths can contain many characters that have special meaning in regular expression patterns. Escaping just the backslashes is not enough for robust checking in the general case.

    Even a simple path, like C:\Program Files (x86)\Vendor\Product\app.exe, contains several special characters. If you want to turn that into a regular expression (or part of a regular expression), you would need to escape not only the backslashes but also the parentheses and the period (dot).

    Fortunately, we can solve our regular expression problem with more regular expressions:

    std::string EscapeForRegularExpression(const std::string &s) {
      static const std::regex metacharacters(R"([\.\^\$\+\(\)\[\]\{\}\|\?\*])");
      return std::regex_replace(s, metacharacters, "\\$&");
    }
    

    (File paths can't contain * or ?, but I've included them to keep the function general.)

    If you don't abide by the "no raw loops" guideline, a probably faster implementation would avoid regular expressions:

    std::string EscapeForRegularExpression(const std::string &s) {
      static const char metacharacters[] = R"(\.^$+()[]{}|?*)";
      std::string out;
      out.reserve(s.size());
      for (auto ch : s) {
        if (std::strchr(metacharacters, ch))
          out.push_back('\\');
        out.push_back(ch);
      }
      return out;
    }
    

    Although the loop adds some clutter, this approach allows us to drop a level of escaping on the definition of metacharacters, which is a readability win over the regex version.