Are Smalltalk bytecode optimizations worth the effort?

Asked 20/1, 2015 at 11:43 Answered 16/3, 2015 at 18:31

Consider the following method in the Juicer class:

Juicer >> juiceOf: aString
    | fruit juice |
    fruit := self gather: aString.
    juice := self extractJuiceFrom: fruit.
    ^juice withoutSeeds

It generates the following bytecodes

25 self                     ; 1
26 pushTemp: 0              ; 2
27 send: gather:
28 popIntoTemp: 1           ; 3
29 self                     ; 4
30 pushTemp: 1              ; 5
31 send: extractJuiceFrom:
32 popIntoTemp: 2           ; 6 <-
33 pushTemp: 2              ; 7 <-
34 send: withoutSeeds
35 returnTop

Now note that 32 and 33 cancel out:

25 self                     ; 1
26 pushTemp: 0              ; 2
27 send: gather:
28 popIntoTemp: 1           ; 3 *
29 self                     ; 4 *
30 pushTemp: 1              ; 5 *
31 send: extractJuiceFrom:
32 storeIntoTemp: 2         ; 6 <-
33 send: withoutSeeds
34 returnTop

Next consider 28, 29 and 30. They insert self below the result of gather. The same stack configuration could have been achieved by pushing self before sending the first message:

25 self                     ; 1 <-
26 self                     ; 2
27 pushTemp: 0              ; 3
28 send: gather:
29 popIntoTemp: 1           ; 4 <-
30 pushTemp: 1              ; 5 <-
31 send: extractJuiceFrom:
32 storeIntoTemp: 2         ; 6
33 send: withoutSeeds
34 returnTop

Now cancel out 29 and 30

25 self                     ; 1
26 self                     ; 2
27 pushTemp: 0              ; 3
28 send: gather:
29 storeIntoTemp: 1         ; 4 <-
30 send: extractJuiceFrom:
31 storeIntoTemp: 2         ; 5
32 send: withoutSeeds
33 returnTop

Temporaries 1 and 2 are written but not read. So, except when debugging, they could be skipped leading to:

25 self                     ; 1
26 self                     ; 2
27 pushTemp: 0              ; 3
28 send: gather:
29 send: extractJuiceFrom:
30 send: withoutSeeds
31 returnTop

This last version, which saves 4 out 7 stack operations, corresponds to the less expressive and clear source:

Juicer >> juiceOf: aString
    ^(self extractJuiceFrom: (self gather: aString)) withoutSeeds

Note also that there are other possible optimizations that Pharo (I haven't checked Squeak) does not implement (e.g., jump chaining.) These optimizations would encourage the Smalltalk programmer to better express their intentions without having to pay the cost of additional computations.

My question is whether these improvements are an illusion or not. Concretely, are bytecode optimizations absent from Pharo/Squeak because they are known to have little relevance, or are they regarded as beneficial but haven't been addressed yet?

EDIT

An interesting advantage of using a register+stack architecture [cf. A Smalltalk Virtual Machine Architectural Model by Allen Wirfs-Brock and Pat Caudill] is that the additional space provided by registers makes it easier the manipulation of bytecodes for the sake of optimization. Of course, even though these kinds of optimizations are not as relevant as method inlining or polymorphic inline caches, as pointed out in the answer below, they shouldn't be disregarded, especially when combined with others implemented by the JIT compiler. Another interesting topic to analyze is whether destructive optimization (i.e., the one that requires de-optimization for supporting the debugger) is actually necessary or enough performance gains can be attained by non-destructive techniques.

Oneiromancy answered 20/1, 2015 at 11:43 Comment(0)

The main annoyance when you start playing with such optimizations is debugger interface.

Historically and still currently in Squeak, the debugger is simulating the bytecode level and needs to map the bytecodes to corresponding Smalltalk instruction.

So I think the gain was too low for justifying complexification, or even worse degradation of debugging facility.

Pharo wants to change the debugger to operate at a higher level (Abstract Syntax Tree), but I don't know how they will end up at bytecode which is all the VM knows of.

IMO, this kind of optimization might better be implemented in the JIT compiler which transforms bytecode to machine native code.

EDIT

The greatest gains are in eliminating the sends themselves (by inlining) because they are much more expensive (x10) than the stack operations - there are 10 times more bytecodes executed per second than sends when you test 1 tinyBenchmarks (COG VM).

Interestingly, such optimizations could take place in the Smalltalk image, but only on hotspot detected by VM, as in the SISTA effort. See for example https://clementbera.wordpress.com/2014/01/22/the-sista-chronicles-iii-an-intermediate-representation-for-optimizations/

So, in the light of SISTA, the answer is rather: interesting, not yet addressed, but actively studied (and work in progress)!

All the machinery for de-optimizing when the method has to be debugged still is one of the difficult points as I understand it.

Carthy answered 20/1, 2015 at 21:28 Comment(4)

I agree in that the gain may not justify the complexity required to implement the optimizations. Note however that you could still keep the stores into temps when debugging and just skip them when interpreting. You also make a good point about the jitter. However, the jitter admits other optimizations and implementing all of them in the same place could be a bit inconvenient. Many thanks for responding. – Oneiromancy 20/1, 2015 at 22:53

@LeandroCaniglia the gain starts to be interesting when also optimizing the sends which are more expensive than stack ops - see my edit – Carthy 20/1, 2015 at 23:18

@LeandroCaniglia about debugging, imagine I execute the optimized method, but an unhandled exception occured because #extractJuiceFrom: is sending #press to fruit and press is not understood. How to display the value of temps fruit and juice is not that obvious, and what happens if I assign one of the temps with another object, etc... – Carthy 20/1, 2015 at 23:29

You are right. (even though juice wouldn't be a problem because it wouldn't exist anyway if the press is not there ;-) – Oneiromancy 21/1, 2015 at 1:57

I think that a broader question is worth answering: are bytecodes worth the effort? Bytecodes were thought as a compact and portable representation of code that is close the target machine. As such, they are easy to interpret, but slow to execute.

Bytecodes do not excel in any of these games, and that usually makes them not the best choice if you want to either write an interpreter or a fast VM. On one hand, AST nodes far easier to interpret (only a few node types vs lots of different bytecodes). On the other hand, with the advent of JIT compilers, it became clear that running native code instead is not only possible but also much faster.

If you look at the most efficient VM implementations of JavaScript (which can be considered the most modern compilers of today) and also Java (HotSpot, Graal), you'll see they all use a tiered compilation scheme. Methods are initially interpreted from the AST, and only jitted when they become a hot spot.

At the hardest tiers of compilation there are no bytecodes. The key component in a compiler is its intermediate representation, and bytecodes do not fulfill the required properties. The most optimizable IRs are much more fine grained: they are in SSA form, and allow specific representation of registers and memory. This allows for much better code analysis and optimization.

Then again, if you are interested in portable code, there isn't anything more portable than the AST. Besides, it's easier and more practical to implement AST-based debuggers and profilers than bytecode-based ones. The only remaining problem is compactness, but in any case you can implement something like ast-codes (coded asts, similar to bytecodes but representing the tree)

On the other hand, if you want full speed, then you'll go for a JIT with a good IR and no bytecodes. I think that bytecodes don't fill many gaps in today VMs, but still remain mostly for backwards compatibility (also there are many examples of hardware archiqutectures that directly execute Java bytecodes).

There are also some cool experiments with the Cog VM related with bytecodes. But from what I understand they transform the bytecode into another IR for optimizing, then they convert back to bytecodes. I'm not sure if there's a technical gain in the last conversion besides reusing the original JIT architecture, or if there actually is any optimization at the bytecode level.

Cicatrize answered 16/3, 2015 at 18:31 Comment(8)

This is a very interesting topic. In a project I'm working on we've taken an intermediate approach by which we reified the bytecodes (using PetitParser.) Our compiled methods still have bytecodes (bytes) but our dev tools and the Jitter use their reified version. In addition, the Smalltalk compiler emits reified bytecodes that later on translate themselves into bytes. – Oneiromancy 17/3, 2015 at 15:5

being in the compiler and jitter business myself, I can only confirm that. The jitter actually has to analyse the bytecode and reconstruct a lot of information which got lost by the bytecode encoing (loops, common subexpressions, constant propagation, etc.) in order to generate better code. – Jaime 19/11, 2015 at 19:58

I'd like to see some references to support this answer. Bytecodes don't smash the cache like ASTs do, which for all intents and purposes, is about as efficient as traversing a linked list of nodes. No evidence indicates ASTs are more portable than a bytecode, as Smalltalk, the original implementation of Tripos (which used BCPL to compile to "O-Code"), and UCSD P-System demonstrated. That all production-ready VMs use bytecodes or its equivalent (i.e., (in)direct-threaded code representation, used often in compilers for smaller architectures) is quite telling. – Rovit 22/2, 2018 at 17:6

I will concede that JIT engines may benefit from ASTs, especially if using Destination-Driven Code Generation techniques. That said, in the past and by-hand, I've taken RPN code and converted it into SSA form, and optimized the resulting code over many small, simple passes. While inconvenient, it's not impossible, nor is the finished result (in my estimation at least) any less valuable than just compiling from an AST to begin with. My goal with my exploration in this space is to eventually improve run-time performance of my Forth compilers for RISC-V, migrating away from direct threading. – Rovit 22/2, 2018 at 17:16

@SamuelA.FalvoII, what you say about portability is 100% true, I just didn't take time to better describe what I meant, which was related to ease of implementation. My experience is that implementing AST interpreters is much easier than bytecode interpreters. Regarding cache smashing, it just doesn't matter because bytecodes are replaced by internal IRs in the third tier (V8, SpiderMonkey, JavaScriptCore, HotSpot, Graal). The bytecodes are only used for code that is never executed frequently, so that's why I think they are not worth optimizing. – Cicatrize 23/2, 2018 at 18:31

And just to add a bit more to the subject, there currently seems to be a tension in IR design between more linear IRs or graph-based ones (sea of nodes). While the former have faster bytecode-to-nativecode time (and are uglier, to me), the latter seem to find more optimizations as they unify control flow and data flow. This is a bit like ast vs. bytecode, but at a much more lower level. – Cicatrize 23/2, 2018 at 18:38

Would you consider AST a linear IR? I would not, knowing that a tree is a kind of graph. However, from the analogy you gave, it seems like you (or perhaps the compiler industry) do. With that assumption in mind, can you provide some links to resources to learn more about the graph-based IRs? I'd be interested in learning more about them. Thanks! – Rovit 25/2, 2018 at 5:46

No, I meant the converse, ASTs are not linear at all. In low-level Linear IRs would be more like bytecodes, and Sea-of-Nodes more like ASTs. I think the seminal work in graph-based IRs was done by Cliff Click (the creator of the HotSpot JIT) in his PhD thesis: Combining Analysis, Combining Optimizations, 1995 - citeseerx.ist.psu.edu/viewdoc/… . Starting from there and going through citations is the most useful thing to do; I never found a compiler book that explained sea-of-nodes in detail. The appendix has many implementation details. – Cicatrize 26/2, 2018 at 11:20

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags