It is possible to compile a version of pypy-c that runs fully “virtualized”, i.e. where an external process controls all input/output. Such a pypy-c is a secure sandbox: it is safe to run any untrusted Python code with it. The Python code cannot see or modify any local file except via interaction with the external process. It is also impossible to do any other I/O or consume more than some amount of RAM or CPU time or real time. This works with no OS support at all - just ANSI C code generated in a careful way. It’s the kind of thing you could embed in a browser plug-in, for example (it would be safe even if it wasn’t run as a separate process, actually).
For comparison, trying to plug CPython into a special virtualizing C library is not only OS-specific, but unsafe, because one of the known ways to segfault CPython could be used by an attacker to trick CPython into issuing malicious system calls directly. The C code generated by PyPy is not segfaultable, as long as our code generators are correct - that’s a lower number of lines of code to trust. For the paranoid, in this case we also generate systematic run-time checks against buffer overflows.
The hard work from the PyPy side is done — you get a fully secure version. What is only experimental and unpolished is the library to use this sandboxed PyPy from a regular Python interpreter (CPython, or an unsandboxed PyPy). Contributions welcome.
One of PyPy’s translation aspects is a sandboxing feature. It’s “sandboxing” as in “full virtualization”, but done in normal C with no OS support at all. It’s a two-processes model: we can translate PyPy to a special “pypy-c-sandbox” executable, which is safe in the sense that it doesn’t do any library or system calls - instead, whenever it would like to perform such an operation, it marshals the operation name and the arguments to its stdout and it waits for the marshalled result on its stdin. This pypy-c-sandbox process is meant to be run by an outer “controller” program that answers these operation requests.
The pypy-c-sandbox program is obtained by adding a transformation during translation, which turns all RPython-level external function calls into stubs that do the marshalling/waiting/unmarshalling. An attacker that tries to escape the sandbox is stuck within a C program that contains no external function calls at all except for writing to stdout and reading from stdin. (It’s still attackable in theory, e.g. by exploiting segfault-like situations, but as explained in the introduction we think that PyPy is rather safe against such attacks.)
The outer controller is a plain Python program that can run in CPython or a regular PyPy. It can perform any virtualization it likes, by giving the subprocess any custom view on its world. For example, while the subprocess thinks it’s using file handles, in reality the numbers are created by the controller process and so they need not be (and probably should not be) real OS-level file handles at all. In the demo controller I’ve implemented there is simply a mapping from numbers to file-like objects. The controller answers to the “os_open” operation by translating the requested path to some file or file-like object in some virtual and completely custom directory hierarchy. The file-like object is put in the mapping with any unused number >= 3 as a key, and the latter is returned to the subprocess. The “os_read” operation works by mapping the pseudo file handle given by the subprocess back to a file-like object in the controller, and reading from the file-like object.
Translating an RPython program with sandboxing enabled also uses a special flag that enables all sorts of C-level assertions against index-out-of-bounds accesses.
By the way, as you should have realized, it’s really independent from the fact that it’s PyPy that we are translating. Any RPython program should do. I’ve successfully tried it on the JS interpreter. The controller is only called “pypy_interact” because it emulates a file hierarchy that makes pypy-c-sandbox happy - it contains (read-only) virtual directories like /bin/lib/pypy1.2/lib-python and /bin/lib/pypy1.2/lib_pypy and it pretends that the executable is /bin/pypy-c.
../../rpython/bin/rpython -O2 --sandbox targetpypystandalone.py
If you don’t have a regular PyPy installed, you should, because it’s faster to translate, but you can also run python translate.py instead.
To run it, use the tools in the pypy/sandbox directory:
./pypy_interact.py /some/path/pypy-c-sandbox [args...]
Just like with pypy-c, if you pass no argument you get the interactive prompt. In theory it’s impossible to do anything bad or read a random file on the machine from this prompt. To pass a script as an argument you need to put it in a directory along with all its dependencies, and ask pypy_interact to export this directory (read-only) to the subprocess’ virtual /tmp directory with the --tmp=DIR option. Example:
mkdir myexported cp script.py myexported/ ./pypy_interact.py --tmp=myexported /some/path/pypy-c-sandbox /tmp/script.py
This is safe to do even if script.py comes from some random untrusted source, e.g. if it is done by an HTTP server.
To limit the used heapsize, use the --heapsize=N option to pypy_interact.py. You can also give a limit to the CPU time (real time) by using the --timeout=N option.
Not all operations are supported; e.g. if you type os.readlink(‘...’), the controller crashes with an exception and the subprocess is killed. Other operations make the subprocess die directly with a “Fatal RPython error”. None of this is a security hole; it just means that if you try to run some random program, it risks getting killed depending on the Python built-in functions it tries to call. This is a matter of the sandboxing layer being incomplete so far, but it should not really be a problem in practice.