I'm currently addressing the Pandas DataFrame.query() Code Injection vulnerability, which allows arbitrary code execution if unsafe user input is processed by the .query() method. I understand this issue arises because the query() method can execute expressions within the context of the DataFrame, potentially leading to security risks.
My questions are as follows:
For example, I can override the query() method in the Pandas source code to validate expressions using Python's ast module and block unsafe constructs. Is this a recommended approach, or does it pose potential risks (e.g., breaking functionality, maintaining the patch long-term)?
Should I wait for an official update to address this issue? Are there any best practices for monitoring when a fix becomes available?
Is it better to apply temporary mitigations (like validating input in my application code) and wait for the library maintainers, or should I fork/patch the library for immediate resolution?
Any insights, especially from those with experience in maintaining secure Python applications, would be greatly appreciated!
Should I wait for an official update to address this issue? Are there any best practices for monitoring when a fix becomes available?
The 'vulnerability' in question strikes me as basically unfixable, so I would not expect a fix to become available.
The method DataFrame.query()
is designed to allow a user to run essentially arbitrary Python code to filter a DataFrame. Passing untrusted code to DataFrame.query()
is exactly as dangerous as passing untrusted code to eval()
.
I asked about this on the Pandas issue tracker, and this was the response from one of Pandas's contributors:
Q: My question is about Pandas's security model. What security guarantees does Pandas make about
DataFrame.query()
with an attacker-controlledexpr
? My intuition about this is "none, don't do that," but I'm wondering what the Pandas project thinks.A: This is indeed my take, both query and eval should be used with string literals and not with strings provided by or derived from untrusted user input.
(Source.)
For example, I can override the query() method in the Pandas source code to validate expressions using Python's ast module and block unsafe constructs.
That strikes me as very difficult to do in the general case. If you look at this thread, you'll see example after example of people proposing a way to sandbox Python execution, and it turns out to not be a perfect sandbox because of a feature of Python that the answer doesn't take into account. For that reason, I think that disallowing unsafe constructs is a doomed approach, because it requires anticipating every unsafe construct.
Rather, I think you should come up with a list of safe constructs, and only allow those.
For example, you could compare the expression to a known-good list of expressions, and only allow those.
allowlist = [
'num > 0',
'num == 0',
'num < 0',
]
if expr in allowlist:
result = df.query(expr)
else:
raise Exception('Illegal expr value')
This restricts the strings that can be passed to DataFrame.query() to one of these pre-approved values.
Is it better to apply temporary mitigations (like validating input in my application code) and wait for the library maintainers, or should I fork/patch the library for immediate resolution?
That's hard to answer in general. To me, I would think about three factors:
In this specific case, I would suggest validating input within application code, unless #1 is really difficult.