Consider the following method in the Juicer
class:
Juicer >> juiceOf: aString
| fruit juice |
fruit := self gather: aString.
juice := self extractJuiceFrom: fruit.
^juice withoutSeeds
It generates the following bytecodes
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3
29 self ; 4
30 pushTemp: 1 ; 5
31 send: extractJuiceFrom:
32 popIntoTemp: 2 ; 6 <-
33 pushTemp: 2 ; 7 <-
34 send: withoutSeeds
35 returnTop
Now note that 32 and 33 cancel out:
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3 *
29 self ; 4 *
30 pushTemp: 1 ; 5 *
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6 <-
33 send: withoutSeeds
34 returnTop
Next consider 28, 29 and 30. They insert self
below the result of gather
. The same stack configuration could have been achieved by pushing self
before sending the first message:
25 self ; 1 <-
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 popIntoTemp: 1 ; 4 <-
30 pushTemp: 1 ; 5 <-
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6
33 send: withoutSeeds
34 returnTop
Now cancel out 29 and 30
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 storeIntoTemp: 1 ; 4 <-
30 send: extractJuiceFrom:
31 storeIntoTemp: 2 ; 5
32 send: withoutSeeds
33 returnTop
Temporaries 1 and 2 are written but not read. So, except when debugging, they could be skipped leading to:
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 send: extractJuiceFrom:
30 send: withoutSeeds
31 returnTop
This last version, which saves 4 out 7 stack operations, corresponds to the less expressive and clear source:
Juicer >> juiceOf: aString
^(self extractJuiceFrom: (self gather: aString)) withoutSeeds
Note also that there are other possible optimizations that Pharo (I haven't checked Squeak) does not implement (e.g., jump chaining.) These optimizations would encourage the Smalltalk programmer to better express their intentions without having to pay the cost of additional computations.
My question is whether these improvements are an illusion or not. Concretely, are bytecode optimizations absent from Pharo/Squeak because they are known to have little relevance, or are they regarded as beneficial but haven't been addressed yet?
EDIT
An interesting advantage of using a register+stack architecture [cf. A Smalltalk Virtual Machine Architectural Model by Allen Wirfs-Brock and Pat Caudill] is that the additional space provided by registers makes it easier the manipulation of bytecodes for the sake of optimization. Of course, even though these kinds of optimizations are not as relevant as method inlining or polymorphic inline caches, as pointed out in the answer below, they shouldn't be disregarded, especially when combined with others implemented by the JIT compiler. Another interesting topic to analyze is whether destructive optimization (i.e., the one that requires de-optimization for supporting the debugger) is actually necessary or enough performance gains can be attained by non-destructive techniques.