Why doesn't the JVM emit prefetch instructions on Windows x86

Asked 4/6, 2017 at 14:34 Answered 4/6, 2017 at 18:20

Solved java assembly x86 jvm jvm-hotspot

As the title states, why doesn't the OpenJDK JVM emit prefetch instruction on Windows x86? See OpenJDK Mercurial @ http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c49dcaf78a65/src/os_cpu/windows_x86/vm/prefetch_windows_x86.inline.hpp

inline void Prefetch::read (void *loc, intx interval) {}
inline void Prefetch::write(void *loc, intx interval) {}

There are no comments and I've found no other resources besides the source code. I am asking because it does so for Linux x86, see http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c49dcaf78a65/src/os_cpu/linux_x86/vm/prefetch_linux_x86.inline.hpp

inline void Prefetch::read (void *loc, intx interval) {
#ifdef AMD64
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));
#endif // AMD64
}

inline void Prefetch::write(void *loc, intx interval) {
#ifdef AMD64

  // Do not use the 3dnow prefetchw instruction.  It isn't supported on em64t.
  //  __asm__ ("prefetchw (%0,%1,1)" : : "r" (loc), "r" (interval));
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));

#endif // AMD64
}

Seller answered 4/6, 2017 at 14:34 Comment(2)

There is also prefetch for solaris x86_64: vm/solaris_x86_64.il github.com/openjdk-mirror/jdk7u-hotspot/blob/…; but all listed prefetches are not for emitting prefetch, they are prefetches to be used by JVM hotspot machine code itself. Emitting prefetches in generated (JITted) code is in x86 code for all os: github.com/openjdk-mirror/jdk7u-hotspot/blob/… LIR_Assembler::prefetchr / LIR_Assembler::prefetchw – Pave 4/6, 2017 at 16:7

Thanks, thats explains at least some things. Maybe add this as a comment and I will accept it. I am still looking for the part where the JVM decides to insert prefetch instructions. – Seller 4/6, 2017 at 16:11

The files you cited all has asm code fragment (inline assembler), which is used by some C/C++ software in its own code (as apangin, the JVM expert pointed, mostly in GC code). And there is actually the difference: Linux, Solaris and BSD variants of x86_64 hotspot have prefetches in the hotspot and windows has them disabled/unimplemented which is partially strange, partially unexplainable why, and it may also make JVM bit (some percents; more on platforms without hardware prefetch) slower on Windows, but still will not help to sell more solaris/solaris paid support contracts for Sun/Oracle. Ross also guessed that inline asm syntax may be not supported with MS C++ compiler, but _mm_prefetch should (Who will open JDK bug to add it to the file?).

JVM hotspot is JIT, and the JITted code is emitted (generated) by JIT as bytes (while it is possible for JIT to copy code from its own functions into generated code or to emit call to the support functions, prefetches are emitted as bytes in hotspot). How can we find how it is emitted? Simple online way is to find some online searchable copy of jdk8u (or better in cross-reference like metager), for example on github: https://github.com/JetBrains/jdk8u_hotspot and do the search of prefetch or prefetch emit or prefetchr or lir_prefetchr. There are some relevant results:

Actual bytes emitted in JVM's c1 compiler / LIR in jdk8u_hotspot/src/cpu/x86/vm/assembler_x86.cpp:

void Assembler::prefetch_prefix(Address src) {
  prefix(src);
  emit_int8(0x0F);
}

void Assembler::prefetchnta(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rax, src); // 0, src
}

void Assembler::prefetchr(Address src) {
  assert(VM_Version::supports_3dnow_prefetch(), "must support");
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x0D);
  emit_operand(rax, src); // 0, src
}

void Assembler::prefetcht0(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rcx, src); // 1, src
}

void Assembler::prefetcht1(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rdx, src); // 2, src
}

void Assembler::prefetcht2(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rbx, src); // 3, src
}

void Assembler::prefetchw(Address src) {
  assert(VM_Version::supports_3dnow_prefetch(), "must support");
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x0D);
  emit_operand(rcx, src); // 1, src
}

Usage in c1 LIR: src/share/vm/c1/c1_LIRAssembler.cpp

void LIR_Assembler::emit_op1(LIR_Op1* op) {
  switch (op->code()) { 
...
    case lir_prefetchr:
      prefetchr(op->in_opr());
      break;

    case lir_prefetchw:
      prefetchw(op->in_opr());
      break;

Now we know the opcode lir_prefetchr and can search for it or in OpenGrok xref and lir_prefetchw, to find the only example in src/share/vm/c1/c1_LIR.cpp

void LIR_List::prefetch(LIR_Address* addr, bool is_store) {
  append(new LIR_Op1(
            is_store ? lir_prefetchw : lir_prefetchr,
            LIR_OprFact::address(addr)));
}

There are other place where prefetch instructions are defined (for C2, as noted by apangin), the src/cpu/x86/vm/x86_64.ad:

// Prefetch instructions. ...
instruct prefetchr( memory mem ) %{
  predicate(ReadPrefetchInstr==3);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHR $mem\t# Prefetch into level 1 cache" %}
  ins_encode %{
    __ prefetchr($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrNTA( memory mem ) %{
  predicate(ReadPrefetchInstr==0);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch into non-temporal cache for read" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrT0( memory mem ) %{
  predicate(ReadPrefetchInstr==1);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHT0 $mem\t# prefetch into L1 and L2 caches for read" %}
  ins_encode %{
    __ prefetcht0($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrT2( memory mem ) %{
  predicate(ReadPrefetchInstr==2);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHT2 $mem\t# prefetch into L2 caches for read" %}
  ins_encode %{
    __ prefetcht2($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchwNTA( memory mem ) %{
  match(PrefetchWrite mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch to non-temporal cache for write" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

// Prefetch instructions for allocation.

instruct prefetchAlloc( memory mem ) %{
  predicate(AllocatePrefetchInstr==3);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHW $mem\t# Prefetch allocation into level 1 cache and mark modified" %}
  ins_encode %{
    __ prefetchw($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocNTA( memory mem ) %{
  predicate(AllocatePrefetchInstr==0);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch allocation to non-temporal cache for write" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocT0( memory mem ) %{
  predicate(AllocatePrefetchInstr==1);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHT0 $mem\t# Prefetch allocation to level 1 and 2 caches for write" %}
  ins_encode %{
    __ prefetcht0($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocT2( memory mem ) %{
  predicate(AllocatePrefetchInstr==2);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHT2 $mem\t# Prefetch allocation to level 2 cache for write" %}
  ins_encode %{
    __ prefetcht2($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

Pave answered 4/6, 2017 at 16:21 Comment(12)

One of the more interesting parts where the JVM actually decides whether to prefetch or not is here github.com/JetBrains/jdk8u_hotspot/blob/… – Seller 4/6, 2017 at 16:33

I am actually working on a scientific paper that includes sentences like "the JVM JIT does prefetching". As there are no real papers about the internals of the JVM I just have to dig in to find evidence even though its common knowledge. Academia just doesnt work that way :) – Seller 4/6, 2017 at 16:37

naze, I can't find how PrefetchAllocationNode is implemented to real opcodes and it has some strange ABIO mark on it. Probably you need to compile JVM/JDK locally to get all files to be generated generated and then do searches on full code (probably with some C++ cross-reference tool; but be aware about non-C++ files like asm and ad which are not searched by cross-reference, only by grep). – Pave 4/6, 2017 at 16:43

After some hours of searching I still don't know where to look for the decision to do or not to do e.g. prefetching of array accesses. I just don't get the structure of the OpenJDK at all. Thanks for your help though. Will add a comment if I find anything – Seller 4/6, 2017 at 17:16

"make JVM bit (<1%) slower" - where is this estimation from? The comment from the times when prefetch was first implemented say that it can speed-up mark-sweep GC up to 2x (at least on SPARC). – Vanbuskirk 4/6, 2017 at 17:30

Referring to C1 LIR is not too useful, since hot methods are usually compiled with C2. The actual rules for emitting prefetch instructions in JITted code are listed in the x86_64 architecture definition file. – Vanbuskirk 4/6, 2017 at 17:39

@Vanbuskirk That comment you linked doesn't refer to the prefetch code that original poster included. The pretech code that comment refering to is for SPARC processors, and the speedups mentioned are for now long obsolete UltraSPARC II and III CPUs. Unlike modern Intel x86 CPUs it looks like these old SPARC CPUs didn't have hardware data prefetching, so using software prefeching instead would easily make a huge difference in performance. – Scarab 4/6, 2017 at 18:9

My guess is that the reason why the inline assembly prefetch instructions aren't used on Windows is because Microsoft C++ compiler doesn't support GNU inline assembly syntax, and doesn't support inline assembly at all on AMD64 targets. The code should be changed to use the portable _mm_prefetch intrinsic instead. – Scarab 4/6, 2017 at 18:16

@RossRidge Yes, I know. I explicitly mentioned SPARC. I just wondered where <1% estimate for Windows comes from. – Vanbuskirk 4/6, 2017 at 19:29

@Vanbuskirk And I explained why. The code referred to that this answer estimates "may also make JVM bit (<1%) slower" is for x86-64 CPUs all of which use hardware data prefetching and wouldn't get any where near the same benefit from the software prefetching that was implemented only for SPARC CPUs in the code referred to in JDK-4453409. The comments made in JDK-4453409 are irrelevant to this question and answer. – Scarab 4/6, 2017 at 20:21

@RossRidge JDK-4453409 does not say that the optimization was implemented only on SPARC, though the performance improvement on other platforms, of course, differs. Another paper describes the software prefetching effect on GC on other 64-bit CPUs, and it is also far more than 1%. So I would be happy to see any reference regarding your claim. – Vanbuskirk 4/6, 2017 at 20:56

@Vanbuskirk JDK-4453409 lists the CPUs affected as "generic, sparc". Any other CPUs affected should have been listed along side "sparc". Also JDK-4453409 was created over a year before GCC supported x86-64 targets and the code in the question could have been written. The paper you linked does not say the software prefetching alone provides a performance improvement. It says that "Combining edge ordered enqueuing with software prefetching yields average performance improvements [... of] 4-6% of total application performance [...]" Software prefetching alone "is unproductive in isolation". – Scarab 4/6, 2017 at 21:25

As JDK-4453409 indicates, prefetching was implemented in HotSpot JVM in JDK 1.4 to speed-up GC. That was more than 15 years ago, no one will remember now why it was not implemented on Windows. My guess is that Visual Studio (which has always been used to build HotSpot on Windows) basically didn’t understand prefetch instruction at these times. Looks like a place for improvement.

Anyway, the code you’ve asked about is used internally by JVM Garbage Collector. This is not what JIT generates. C2 JIT code generator rules are in the architecture definition file x86_64.ad, and there are rules to translate PrefetchRead, PrefetchWrite and PrefetchAllocation nodes to the corresponding x64 instructions.

An insteresting fact is that PrefetchRead and PrefetchWrite nodes are not created anywhere in the code. They exist only to support Unsafe.prefetchX intrinsics, however, they are removed in JDK 9.

The only case when JIT generates prefetch instruction is PrefetchAllocation node. You can verify with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly that PREFETCHNTA is indeed generated after object allocation, both on Linux and Windows.

class Test {
    public static void main(String[] args) {
        byte[] b = new byte[0];
        for (;;) {
            b = Arrays.copyOf(b, b.length + 1);
        }
    }
}

java.exe -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Test

# {method} {0x00000000176124e0} 'main' '([Ljava/lang/String;)V' in 'Test'
  ...
  0x000000000340e512: cmp    $0x100000,%r11d
  0x000000000340e519: ja     0x000000000340e60f
  0x000000000340e51f: movslq 0x24(%rsp),%r10
  0x000000000340e524: add    $0x1,%r10
  0x000000000340e528: add    $0x17,%r10
  0x000000000340e52c: mov    %r10,%r8
  0x000000000340e52f: and    $0xfffffffffffffff8,%r8
  0x000000000340e533: cmp    $0x100000,%r11d
  0x000000000340e53a: ja     0x000000000340e496
  0x000000000340e540: mov    0x60(%r15),%rbp
  0x000000000340e544: mov    %rbp,%r9
  0x000000000340e547: add    %r8,%r9
  0x000000000340e54a: cmp    0x70(%r15),%r9
  0x000000000340e54e: jae    0x000000000340e496
  0x000000000340e554: mov    %r9,0x60(%r15)
  0x000000000340e558: prefetchnta 0xc0(%r9)
  0x000000000340e560: movq   $0x1,0x0(%rbp)
  0x000000000340e568: prefetchnta 0x100(%r9)
  0x000000000340e570: movl   $0x200000f5,0x8(%rbp)  ;   {metadata({type array byte})}
  0x000000000340e577: mov    %r11d,0xc(%rbp)
  0x000000000340e57b: prefetchnta 0x140(%r9)
  0x000000000340e583: prefetchnta 0x180(%r9)    ;*newarray
                                                ; - java.util.Arrays::copyOf@1 (line 3236)
                                                ; - Test::main@9 (line 9)

Vanbuskirk answered 4/6, 2017 at 18:20 Comment(4)

+1 for finding out that prefetching is only used in context of allocatings. I would have guessed that prefetching is also done when iterating an existing array. It seems my assumption was wrong. Thanks for clarifying – Seller 5/6, 2017 at 13:27

@naze, there will be prefetching in iterating over array; but it is not software prefetch, but hardware prefetch. You may turn it off and measure to find its effect on Intel: software.intel.com/en-us/articles/… (with wrmsr -p N 0x1a4 for every core); "0x1A0 bits 9 and 19 were used for this in older processor models" - stackoverflow.com/a/36339469. Intel's hw prefetch is aggressive, but limited to 4KB pages: if they catches two memory access A & B with ptrdiff of N=B-A, and B+N is in the same 4 KB, they prefetches. – Pave 5/6, 2017 at 15:51

@Pave I already knew about hardware prefetching, I just wondered about software prefetching in the jvm. Thanks anyway – Seller 5/6, 2017 at 21:18

@naze: On modern CPUs with good HW prefetchers, there's typically no benefit to using software prefetch for sequential access. IvyBridge and later can even prefetch across 4k page boundaries. (source: bottom of this answer: https://mcmap.net/q/831351/-when-should-we-use-prefetch). Prefetch instructions take time to run, so they can slow your code down in cases where it wasn't actually memory bottlenecked. (Or on Intel IvyBridge, prefetch instructions can have very bad throughput like one per 43 cycles according to Agner Fog's tables, and be a bottleneck themselves.) – Mead 18/7, 2017 at 7:3