pythonscripting-languageembedded-language

Embedding a Low Performance Scripting Language in Python


I have a web-application. As part of this, I need users of the app to be able to write (or copy and paste) very simple scripts to run against their data.

The scripts really can be very simple, and performance is only the most minor issue. And example of the sophistication of script I mean are something like:

ratio = 1.2345678
minimum = 10

def convert(money)
    return money * ratio
end

if price < minimum
    cost = convert(minimum)
else
    cost = convert(price)
end

where price and cost are a global variables (something I can feed into the environment and access after the computation).

I do, however, need to guarantee some stuff.

  1. Any scripts run cannot get access to the environment of Python. They cannot import stuff, call methods I don't explicitly expose for them, read or write files, spawn threads, etc. I need total lockdown.

  2. I need to be able to put a hard-limit on the number of 'cycles' that a script runs for. Cycles is a general term here. could be VM instructions if the language byte-compiled. Apply-calls for an Eval/Apply loop. Or just iterations through some central processing loop that runs the script. The details aren't as important as my ability to stop something running after a short time and send an email to the owner and say "your scripts seems to be doing more than adding a few numbers together - sort them out."

  3. It must run on Vanilla unpatched CPython.

So far I've been writing my own DSL for this task. I can do that. But I wondered if I could build on the shoulders of giants. Is there a mini-language available for Python that would do this?

There are plenty of hacky Lisp-variants (Even one I wrote on Github), but I'd prefer something with more non-specialist syntax (more C or Pascal, say), and as I'm considering this as an alternative to coding one myself I'd like something a bit more mature.

Any ideas?


Solution

  • Here is my take on this problem. Requiring that the user scripts run inside vanilla CPython means you either need to write an interpreter for your mini language, or compile it to Python bytecode (or use Python as your source language) and then "sanitize" the bytecode before executing it.

    I've gone for a quick example based on the assumption that users can write their scripts in Python, and that the source and bytecode can be sufficiently sanitized through some combination of filtering unsafe syntax from the parse tree and/or removing unsafe opcodes from the bytecode.

    The second part of the solution requires that the user script bytecode be periodically interrupted by a watchdog task which will ensure that the user script does not exceed some opcode limit, and for all of this to run on vanilla CPython.

    Summary of my attempt, which mostly focuses on the 2nd part of the problem.

    Hopefully this at least goes in the right direction. I'm interested to hear more about your solution when you arrive at it.

    Source code for lowperf.py:

    # std
    import ast
    import dis
    import sys
    from pprint import pprint
    
    # vendor
    import byteplay
    import greenlet
    
    # bytecode snippet to increment our global opcode counter
    INCREMENT = [
        (byteplay.LOAD_GLOBAL, '__op_counter'),
        (byteplay.LOAD_CONST, 1),
        (byteplay.INPLACE_ADD, None),
        (byteplay.STORE_GLOBAL, '__op_counter')
        ]
    
    # bytecode snippet to perform a yield to our watchdog tasklet.
    YIELD = [
        (byteplay.LOAD_GLOBAL, '__yield'),
        (byteplay.LOAD_GLOBAL, '__op_counter'),
        (byteplay.CALL_FUNCTION, 1),
        (byteplay.POP_TOP, None)
        ]
    
    def instrument(orig):
        """
        Instrument bytecode.  We place a call to our yield function before
        jumps and returns.  You could choose alternate places depending on 
        your use case.
        """
        line_count = 0
        res = []
        for op, arg in orig.code:
            line_count += 1
    
            # NOTE: you could put an advanced bytecode filter here.
    
            # whenever a code block is loaded we must instrument it
            if op == byteplay.LOAD_CONST and isinstance(arg, byteplay.Code):
                code = instrument(arg)
                res.append((op, code))
                continue
    
            # 'setlineno' opcode is a safe place to increment our global 
            # opcode counter.
            if op == byteplay.SetLineno:
                res += INCREMENT
                line_count += 1
    
            # append the opcode and its argument
            res.append((op, arg))
    
            # if we're at a jump or return, or we've processed 10 lines of
            # source code, insert a call to our yield function.  you could 
            # choose other places to yield more appropriate for your app.
            if op in (byteplay.JUMP_ABSOLUTE, byteplay.RETURN_VALUE) \
                    or line_count > 10:
                res += YIELD
                line_count = 0
    
        # finally, build and return new code object
        return byteplay.Code(res, orig.freevars, orig.args, orig.varargs,
            orig.varkwargs, orig.newlocals, orig.name, orig.filename,
            orig.firstlineno, orig.docstring)
    
    def transform(path):
        """
        Transform the Python source into a form safe to execute and return
        the bytecode.
        """
        # NOTE: you could call ast.parse(data, path) here to get an
        # abstract syntax tree, then filter that tree down before compiling
        # it into bytecode.  i've skipped that step as it is pretty verbose.
        data = open(path, 'rb').read()
        suite = compile(data, path, 'exec')
        orig = byteplay.Code.from_code(suite)
        return instrument(orig)
    
    def execute(path, limit = 40):
        """
        This transforms the user's source code into bytecode, instrumenting
        it, then kicks off the watchdog and user script tasklets.
        """
        code = transform(path)
        target = greenlet.greenlet(run_task)
    
        def watcher_task(op_count):
            """
            Task which is yielded to by the user script, making sure it doesn't
            use too many resources.
            """
            while 1:
                if op_count > limit:
                    raise RuntimeError("script used too many resources")
                op_count = target.switch()
    
        watcher = greenlet.greenlet(watcher_task)
        target.switch(code, watcher.switch)
    
    def run_task(code, yield_func):
        "This is the greenlet task which runs our user's script."
        globals_ = {'__yield': yield_func, '__op_counter': 0}
        eval(code.to_code(), globals_, globals_)
    
    execute(sys.argv[1])
    

    Here is a sample user script user.py:

    def otherfunc(b):
        return b * 7
    
    def myfunc(a):
        for i in range(0, 20):
            print i, otherfunc(i + a + 3)
    
    myfunc(2)
    

    Here is a sample run:

    % python lowperf.py user.py
    
    0 35
    1 42
    2 49
    3 56
    4 63
    5 70
    6 77
    7 84
    8 91
    9 98
    10 105
    11 112
    Traceback (most recent call last):
      File "lowperf.py", line 114, in <module>
        execute(sys.argv[1])
      File "lowperf.py", line 105, in execute
        target.switch(code, watcher.switch)
      File "lowperf.py", line 101, in watcher_task
        raise RuntimeError("script used too many resources")
    RuntimeError: script used too many resources