regexperlsecuritytaint

Perl tainting via regular expression


Short version

In the code below, $1 is tainted and I don't understand why.

Long version

I'm running Foswiki on a system with perl v5.14.2 with -T taint check mode enabled. Debugging a problem with that setup, I managed to construct the following SSCCE. (Note that I edited this post, the first version was longer and more complicated, and comments still refer to that.)

#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");

Although the input string $var is untainted and the regular expression is fixed, the resulting capture group $1 is tainted. Which I find really strange.

The perlsec manual has this to say about taint and regular expressions:

Values may be untainted by using them as keys in a hash; otherwise the only way to bypass the tainting mechanism is by referencing subpatterns from a regular expression match. Perl presumes that if you reference a substring using $1, $2, etc., that you knew what you were doing when you wrote the pattern.

I would imagine that even if the input were tainted, the output would still be untainted. To observe the reverse, tainted output from untainted input, feels like a strange bug in perl. But if one reads more of perlsec, it also points users at the SECURITY section of perllocale. There we read:

when use locale is in effect, Perl uses the tainting mechanism (see perlsec) to mark string results that become locale-dependent, and which may be untrustworthy in consequence. Here is a summary of the tainting behavior of operators and functions that may be affected by the locale:

  • Comparison operators (lt, le , ge, gt and cmp) […]

  • Case-mapping interpolation (with \l, \L, \u or \U) […]

  • Matching operator (m//):

    Scalar true/false result never tainted.

    Subpatterns, either delivered as a list-context result or as $1 etc. are tainted if use locale (but not use locale ':not_characters') is in effect, and the subpattern regular expression contains \w (to match an alphanumeric character), \W (non-alphanumeric character), \s (whitespace character), or \S (non whitespace character). The matched-pattern variable, $&, $` (pre-match), $' (post-match), and $+ (last match) are also tainted if use locale is in effect and the regular expression contains \w, \W, \s, or \S.

  • Substitution operator (s///) […]

        [⋮]

This looks like it should be an exhaustive list. And I don't see how it could apply: My regex is not using any of \w, \W, \s or \S, so it should not depend on locale.

Can someone explain why this code taints the varibale $1?


Solution

  • There currently is a discrepancy between the documentation as quoted in the question, and the actual implementation as of perl 5.18.1. The problem are character classes. The documentation mentions \w,\s,\W,\S in what sounds like an exhaustive list, while the implementation taints on pretty much every use of […].

    The right solution would probably be somewhere in between: character classes like [[:word:]] should taint, since it depends on locale. My fixed list should not. Character ranges like [a-z] depend on collation, so in my personal opinion they should taint as well. \d depends on what a locale considers a digit, so it, too, should taint even if it is neither one of the escape sequences mentioned so far nor a bracketed class.

    So in my opinion, both the documentation and the implementation need fixing. Perl devs are working on this. For progress information, please look at the perl bug report I filed.

    For a fixed list of characters, one viable workaround appears to be a formulation as a disjunction, i.e. (?:\.|_) instead of [._]. It is more verbose, but should work even with the current (in my opinion buggy) perl versions.