[PEAK] Fwd: Adding C generation to bytecode assembler

Sun Aug 20 19:32:12 EDT 2006

Hello PEAKers,

I've been slowly kicking around an idea for a couple years now based
originally on my thought experiments WRT the now rejected PEP 330 (bytecode
verification).  I was pleased to stumble upon peak.util.assembler just
recently and remove 3/4 of my original pseudo-assembler code for Phillip's
much cleaner implementation (most notably his much superior stack tracking
code).  Ironically my code that I removed was adapted from code originally
proposed by Phillip on python-dev in response to PEP 330, code, or ideas,
which I don't doubt in turn inspired to some degree peak.util.assembler.
It's weird how code and ideas flow around and around sometimes before
landing.

The idea I am working on is definitely controvertial, mostly because so many
others are actively working on these ideas in other, much larger projects,
and I'm hesitant to bring it up in public without further proof of concept,
but since I've now based my code on peak.util.assembler I think it's at
least worth putting forth for discussion if at least to find out if Phillip
has essentially written its equivalent this morning before he ate breakast
and is about to beat me to the punch, if he hasn't already. ;)

I've been experimenting with transforming Python bytecode into equivalent C
"CPython" code.  This is not a new or particularly compelling idea, there
have been and are several attempt to do essentially the same thing and then
some.  But I am taking an extremely simplistic (one could say "dirt simple")
aproach, evolved from my work implementing a native code Parrot compiler,
that I think has advantages over the much more complex projects that exist
to date.  Here are some highlights:

  - The obvious removal of the overhead of the inner interpreter, instead of
looping and switching, instructions flow straight from one to the other and
jumps are acomplished with C goto statements.

  - Using as much "compile time" information as possible to remove run-time
checks.
  For example, the argument to COMPARE_OP, like all opargs, is known at
compile time, and thus reduced significantly in the generated C.

  - Experimentally optimizing away the block stack at compile time.

  - Very experimentally optimizing away most data stack traffic into C
variables.

What I am definitey *not* attempting to do is any kind or form of type
inference or optimization or removal of the existing CPython runtime,
compiler, or existing functionality.  My goal is only to remove as much of
the overhead of the existing Python language semantics without defining new
ones (I consider any kind of type inference or reduction to be "new
semantics").   The resultant C extension module that is generated should
have as close to exactly the same execution semantics as the existing
bytecode interpreter, just minus the significant amounts of looping,
switching and per-bytecode run-time checking and other overhead.  My goals
in general are:

  - To embrace as much of the CPython runtime as possible (or inversely,
reimplement as little as possible) while still avoiding the overhead of the
inner interpreter.

  - To determine what are, and only optimize those functions that are worth
it.

  - To let the C compiler have its chance optimizing the actual Python
program, not just the interpreter.

  - To give hardware cache and branch prediction a chance to optimize the
actual Python program, not just the interpreter. (Anecdotaly I read on
python-dev once that the inner interpreter was so small that it could fit
into most processor caches.  Another way of looking at it is that it is so
big that it hogs them!)

  - To provide a means to distribute compiled Python modules without source
or bytecode.

  - To generate clearly readable, hand-optimizable and heavily commented C
code.

The code I'm generating does not try to entirely replace a function that it
compiles, in fact the original function object (and its encapsulated code
object) is required to define the defaults, names, globals, stack
information and area, and (when called) frame that the generated C module
requires to execute.  It replaces *only* the code.co_code object.  I have a
tiny patch that adds a single instance variable to a function to hang the
compiled extension function off of, and I'm using zope.proxy to intercept
the original function's __call__ and call the extension function with the
setup unused frame object created for __call__.   This is only an temporary
solution to providing a true function type clone.

After generating the C code I use distutils to compile it and insert the
newly compiled function into original function object which is then proxied.

My initial crude benchmark results are good, simple loops with math ops have
a 2x increase in speed, with nested loops gaining even more.  This is
without any stack movement reduction yet, so once I implement that I think
the performance will jump even more.

So, this is really just a kind of heads up message, and an initial toe in
the water to see if there is any interest in this idea or other
implementations or projects exploring these ideas out there; like p2c,
Psycho, Pyrex, PyPy, Shed, Parrot, or Starkiller, which are the very broad
results I found in my research on different phases of this topic.

Thanks,

-Michel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.eby-sarna.com/pipermail/peak/attachments/20060820/47638c3a/attachment.html