Embedding a Low Performance Scripting Language in Python [closed]
Asked Answered
U

8

24

I have a web-application. As part of this, I need users of the app to be able to write (or copy and paste) very simple scripts to run against their data.

The scripts really can be very simple, and performance is only the most minor issue. And example of the sophistication of script I mean are something like:

ratio = 1.2345678
minimum = 10

def convert(money)
    return money * ratio
end

if price < minimum
    cost = convert(minimum)
else
    cost = convert(price)
end

where price and cost are a global variables (something I can feed into the environment and access after the computation).

I do, however, need to guarantee some stuff.

  1. Any scripts run cannot get access to the environment of Python. They cannot import stuff, call methods I don't explicitly expose for them, read or write files, spawn threads, etc. I need total lockdown.

  2. I need to be able to put a hard-limit on the number of 'cycles' that a script runs for. Cycles is a general term here. could be VM instructions if the language byte-compiled. Apply-calls for an Eval/Apply loop. Or just iterations through some central processing loop that runs the script. The details aren't as important as my ability to stop something running after a short time and send an email to the owner and say "your scripts seems to be doing more than adding a few numbers together - sort them out."

  3. It must run on Vanilla unpatched CPython.

So far I've been writing my own DSL for this task. I can do that. But I wondered if I could build on the shoulders of giants. Is there a mini-language available for Python that would do this?

There are plenty of hacky Lisp-variants (Even one I wrote on Github), but I'd prefer something with more non-specialist syntax (more C or Pascal, say), and as I'm considering this as an alternative to coding one myself I'd like something a bit more mature.

Any ideas?

Uteutensil answered 24/2, 2011 at 0:27 Comment(2)
"yo dawg, I heard you like scripting languages"Mirabella
Thanks to the answers so far. But I'm leaving this open a bit longer and adding a bounty to see if there's something more to the point. I want to stress that with all the tools around for parser generation, building a parser is not the hard bit of building a language. So I want to encourage answers that address the meat of the issue. Thanks!Uteutensil
A
18

Here is my take on this problem. Requiring that the user scripts run inside vanilla CPython means you either need to write an interpreter for your mini language, or compile it to Python bytecode (or use Python as your source language) and then "sanitize" the bytecode before executing it.

I've gone for a quick example based on the assumption that users can write their scripts in Python, and that the source and bytecode can be sufficiently sanitized through some combination of filtering unsafe syntax from the parse tree and/or removing unsafe opcodes from the bytecode.

The second part of the solution requires that the user script bytecode be periodically interrupted by a watchdog task which will ensure that the user script does not exceed some opcode limit, and for all of this to run on vanilla CPython.

Summary of my attempt, which mostly focuses on the 2nd part of the problem.

  • User scripts are written in Python.
  • Use byteplay to filter and modify the bytecode.
  • Instrument the user's bytecode to insert an opcode counter and calls to a function which context switches to the watchdog task.
  • Use greenlet to execute the user's bytecode, with yields switching between the user's script and the watchdog coroutine.
  • The watchdog enforces a preset limit on the number of opcodes which can be executed before raising an error.

Hopefully this at least goes in the right direction. I'm interested to hear more about your solution when you arrive at it.

Source code for lowperf.py:

# std
import ast
import dis
import sys
from pprint import pprint

# vendor
import byteplay
import greenlet

# bytecode snippet to increment our global opcode counter
INCREMENT = [
    (byteplay.LOAD_GLOBAL, '__op_counter'),
    (byteplay.LOAD_CONST, 1),
    (byteplay.INPLACE_ADD, None),
    (byteplay.STORE_GLOBAL, '__op_counter')
    ]

# bytecode snippet to perform a yield to our watchdog tasklet.
YIELD = [
    (byteplay.LOAD_GLOBAL, '__yield'),
    (byteplay.LOAD_GLOBAL, '__op_counter'),
    (byteplay.CALL_FUNCTION, 1),
    (byteplay.POP_TOP, None)
    ]

def instrument(orig):
    """
    Instrument bytecode.  We place a call to our yield function before
    jumps and returns.  You could choose alternate places depending on 
    your use case.
    """
    line_count = 0
    res = []
    for op, arg in orig.code:
        line_count += 1

        # NOTE: you could put an advanced bytecode filter here.

        # whenever a code block is loaded we must instrument it
        if op == byteplay.LOAD_CONST and isinstance(arg, byteplay.Code):
            code = instrument(arg)
            res.append((op, code))
            continue

        # 'setlineno' opcode is a safe place to increment our global 
        # opcode counter.
        if op == byteplay.SetLineno:
            res += INCREMENT
            line_count += 1

        # append the opcode and its argument
        res.append((op, arg))

        # if we're at a jump or return, or we've processed 10 lines of
        # source code, insert a call to our yield function.  you could 
        # choose other places to yield more appropriate for your app.
        if op in (byteplay.JUMP_ABSOLUTE, byteplay.RETURN_VALUE) \
                or line_count > 10:
            res += YIELD
            line_count = 0

    # finally, build and return new code object
    return byteplay.Code(res, orig.freevars, orig.args, orig.varargs,
        orig.varkwargs, orig.newlocals, orig.name, orig.filename,
        orig.firstlineno, orig.docstring)

def transform(path):
    """
    Transform the Python source into a form safe to execute and return
    the bytecode.
    """
    # NOTE: you could call ast.parse(data, path) here to get an
    # abstract syntax tree, then filter that tree down before compiling
    # it into bytecode.  i've skipped that step as it is pretty verbose.
    data = open(path, 'rb').read()
    suite = compile(data, path, 'exec')
    orig = byteplay.Code.from_code(suite)
    return instrument(orig)

def execute(path, limit = 40):
    """
    This transforms the user's source code into bytecode, instrumenting
    it, then kicks off the watchdog and user script tasklets.
    """
    code = transform(path)
    target = greenlet.greenlet(run_task)

    def watcher_task(op_count):
        """
        Task which is yielded to by the user script, making sure it doesn't
        use too many resources.
        """
        while 1:
            if op_count > limit:
                raise RuntimeError("script used too many resources")
            op_count = target.switch()

    watcher = greenlet.greenlet(watcher_task)
    target.switch(code, watcher.switch)

def run_task(code, yield_func):
    "This is the greenlet task which runs our user's script."
    globals_ = {'__yield': yield_func, '__op_counter': 0}
    eval(code.to_code(), globals_, globals_)

execute(sys.argv[1])

Here is a sample user script user.py:

def otherfunc(b):
    return b * 7

def myfunc(a):
    for i in range(0, 20):
        print i, otherfunc(i + a + 3)

myfunc(2)

Here is a sample run:

% python lowperf.py user.py

0 35
1 42
2 49
3 56
4 63
5 70
6 77
7 84
8 91
9 98
10 105
11 112
Traceback (most recent call last):
  File "lowperf.py", line 114, in <module>
    execute(sys.argv[1])
  File "lowperf.py", line 105, in execute
    target.switch(code, watcher.switch)
  File "lowperf.py", line 101, in watcher_task
    raise RuntimeError("script used too many resources")
RuntimeError: script used too many resources
Abuse answered 4/3, 2011 at 6:57 Comment(2)
There is an effort to revive restricted execution in CPython.Abuse
Thanks so much for that. I haven't come across byteplay before, it is a very nice approach, and really gets at the tough end of the problem. I'm going to have a play with this. This may well link nicely with the YouGov LimPy project (suggested by chmullig) to provide the restricted python subset. I'm getting the sense that I'm not just missing something that is already out there, so I'm going to accept this answer.Uteutensil
E
8

Jispy is the perfect fit!

  • It is a JavaScript interpreter in Python, built primarily for embedding JS in Python.

  • Notably, it provides checks and caps on recursion and looping. Just as is needed.

  • It easily allows you to make python functions available to JavaScript code.

  • By default, it doesn't expose the host's file system or any other sensitive element.

Full Disclosure:

  • Jispy is my project. I am obviously biased toward it.
  • Nonetheless, here, it really does seem to be the perfect fit.

PS:

  • This answer is being written ~3 years after this question was asked.
  • The motivation behind such a late answer is simple:
    Given how closely Jispy confines to the question at hand, future readers with similar requirements should be able to benefit from it.
Ectropion answered 17/11, 2014 at 18:49 Comment(0)
M
5

Try Lua. The syntax you mentioned is almost identical to Lua's. See How could I embed Lua into Python 3.x?

Mask answered 24/2, 2011 at 3:34 Comment(1)
This would be ideal. Lunatic Python I have audited for this purpose. But it is a non-starter, since you can't control the VM and bad lua code can crash python. But I'll look at LuaPy. Thanks.Uteutensil
M
4

I don't know of anything that really solves this problem yet.

I think the absolute simplest thing you could do would be to write your own version of the python virtual machine in python.

I've often thought of doing that in something like Cython so you could just import it as a module, and you could lean on the existing runtime for most of the hard bits.

You may already be able to generate a python-in-python interpreter with PyPy, but PyPy's output is a runtime that does EVERYTHING, including implementing the equivalent of the underlying PyObjects for built-in types and all that, and I think that's overkill for this kind of thing.

All you really need is something that works like a Frame in the execution stack, and then a method for each opcode. I don't think you even have to implement it yourself. You could just write a module that exposed the existing frame objects to the runtime.

Anyway, then you just maintain your own stack of frame objects and handle the bytecodes, and you can throttle it with bytecodes per second or whatever.

Montagnard answered 24/2, 2011 at 4:0 Comment(1)
Now that is an interesting idea. I hadn't thought about using an existing VM spec, and leveraging the tools already targeting that. Good idea. I'll have a play and see what the implication might be.Uteutensil
R
2

I've used Python as a "mini config language" for an earlier project. My approach was to take the code, parse it using the parser module and then to walk the AST of the generated code and to kick out "unallowed" operations (e.g. defining classes, called __ methods etc.).

After I do this, a created a synthetic environment with only the modules and variables that were "allowed" and evaluated the code within that to get something I could run.

It worked nicely for me. I don't know if it's bullet proof especially if you want to give your users more power than I did for a config language.

As for the time limit, you could run your program in a separate thread or process and terminate it after a fixed amount of time.

Raffish answered 24/2, 2011 at 3:45 Comment(2)
I could see that working, but I'd be worried that I wouldn't know enough to know if I'd missed a leaky construct. It may even be that there is no syntactic way to determine such a thing. Sounds dangerous. As for running in a thread, yes that would be the fallback approach, I think. Ideally I'd like it not to influenced by how much load is on the server - so cycles would be better than absolute time.Uteutensil
That possibility exists. I was in a situation where I could be really tight about things. No variables except the ones I was injecting, no function calls etc.Raffish
C
1

Why not python code in pysandbox http://pypi.python.org/pypi/pysandbox/1.0.3 ?

Celestecelestia answered 24/2, 2011 at 0:37 Comment(1)
Because it fails my second criteria, as far as I can tell. And it may fail criteria 1 too. At least, there is more doubt than I'd like about the security of pysandbox. And finally, you can segfault it, which is a non-starter for me.Uteutensil
A
1

Take a look at LimPy. It stands for Limited Python and was built for exactly this purpose.

There was an environment where users needed to write basic logic to control a user experience. I don't know how it'll interact with runtime limits, but I imagine you can do it if you're willing to write a little code.

Apocopate answered 24/2, 2011 at 1:1 Comment(1)
Thanks for the suggestion, LimPy is a neat project. Unfortunately it is only a Python-subset parser, rather than a language. It has no execution semantics. Parsing a DSL is pretty easy. Tools like ANTLR, and even my own Sparkle (github.com/idmillington/sparkle) parser generator in python make it a snap. Running the parsed code is the hard bit!Uteutensil
D
-1

The simplest way to make a real DSL is ANTLR, it has syntax templates for some popular languages.

Deify answered 24/2, 2011 at 0:42 Comment(3)
I don't want to be a jerk, but I did say in the question that I could and have implemented a DSL so far, but was looking for a more mature extant system.Uteutensil
@Ian, I meant, you don't have to reinvent it, just take some example code. I only offered the simplest way I know.Deify
Yes, sorry, I really didn't mean to be a jerk, I realise you were being helpful :) I know ANTLR, and it is a good way to build a DSL. But that wasn't quite what I was asking.Uteutensil

© 2022 - 2024 — McMap. All rights reserved.