Foreword
Background
Been at a Java shop. Spent entire months dedicated to running performance tests on distributed systems, the main apps being in Java. Some of which implying products developed and sold by Sun themselves (then Oracle).
I will go over the lessons I learned, some history about the JVM, some talks about the internals, a couple of parameters explained and finally some tuning. Trying to keep it to the point so you can apply it in practice.
Things are changing fast in the Java world so part of it might be already outdated since the last year I've done all that. (Is Java 10 out already?)
Good Practices
What you SHOULD do: benchmark, Benchmark, BENCHMARK!
When you really need to know about performances, you need to perform real benchmarks, specific to your workload. There is no alternatives.
Also, you should monitor the JVM. Enable monitoring. The good applications usually provide a monitoring web page and/or an API. Otherwise there is the common Java tooling (JVisualVM, JMX, hprof, and some JVM flags).
Be aware that there is usually no performance to gain by tuning the JVM. It's more a "to crash or not to crash, finding the transition point". It's about knowing that when you give that amount of resources to your application, you can consistently expect that amount of performances in return. Knowledge is power.
Performances is mostly dictated by your application. If you want faster, you gotta write better code.
What you WILL do most of the time: Live with reliable sensitive defaults
We don't get time to optimize and tune every single application out there. Most of the time we'll simply live with sensible defaults.
The first thing to do when configuring a new application is to read the documentation. Most of the serious applications comes with a guide for performance tuning, including advice on JVM settings.
Then you can configure the application: JAVA_OPTS: -server -Xms???g -Xmx???g
-server
: enable full optimizations (this flag is automatic on most JVM nowadays)
-Xms
-Xmx
: set the minimum and maximum heap (always the same value for both, that's about the only optimizations to do).
Well done, you know about all the optimization parameters there is to know about the JVM, congratulations! That was simple :D
What you SHALL NOT do, EVER:
Please do NOT copy random string you found on the internet, especially when they take multiple lines like that:
-server -Xms1g -Xmx1g -XX:PermSize=1g -XX:MaxPermSize=256m -Xmn256m -Xss64k -XX:SurvivorRatio=30 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=10 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark -XX:+PrintGCDateStamps -verbose:gc -XX:+PrintGCDetails -Dsun.net.inetaddr.ttl=5 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=`date`.hprof -Dcom.sun.management.jmxremote.port=5616 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -server -Xms2g -Xmx2g -XX:MaxPermSize=256m -XX:NewRatio=1 -XX:+UseConcMarkSweepGC
For instance, this thing found on the first page of google is plain terrible. There are arguments specified multiples times with conflicting values. Some are just forcing the JVM defaults (eventually the defaults from 2 JVM versions ago). A few are obsolete and simply ignored. And finaly at least one parameter is so invalid that it will consistently crash the JVM at startup by it's mere existence.
Actual tuning
How do you choose the memory size:
Read the guide from your application, it should give some indication. Monitor production and adjust afterwards. Perform some benchmarks if you need accuracy.
Important Note: The java process will take up to max heap PLUS 10%. The X% overhead being the heap management, not included in the heap itself.
All the memory is usually preallocated by the process on startup. You may see the process using max heap ALL THE TIME. It's simply not true. You need to use Java monitoring tools to see what is really being used.
Finding the right size:
- If it crashes with OutOfMemoryException, it ain't enough memory
- If it doesn't crash with OutOfMemoryException, it's too much memory
- If it's too much memory BUT the hardware got it and/or is already paid for, it's the perfect number, job done!
JVM6 is bronze, JVM7 is gold, JVM8 is platinum...
The JVM is forever improving. Garbage Collection is a very complex thing and there are a lot of very smart people working on it. It had tremendous improvements in the past decade and it will continue to do so.
For informational purpose. They are at least 4 available Garbage Collectors in Oracle Java 7-8 (HotSpot) and OpenJDK 7-8. (Other JVM may be entirely different e.g. Android, IBM, embedded):
- SerialGC
- ParallelGC
- ConcurrentMarkSweepGC
- G1GC
- (plus variants and settings)
[Starting from Java 7 and onward. The Oracle and OpenJDK code are partially shared. The GC should be (mostly) the same on both platforms.]
JVM >= 7 have many optimizations and pick decent defaults. It changes a bit by platform. It balances multiple things. For instance deciding to enable multicore optimizations or not whether the CPU has multiple cores. You should let it do it. Do not change or force GC settings.
It's okay to let the computer takes decision for you (that's what computers are for). It's better to have the JVM settings being 95%-optimal all the time than forcing a "always 8 core aggressive collection for lower pause times" on all the boxes, half of them being t2.small in the end.
Exception: When the application comes with a performance guide and specific tuning in place. It's perfectly okay to leave the provided settings as is.
Tip: Moving to a newer JVM to benefit from the latest improvements can sometimes provide a good boost without much effort.
Special Case: -XX:+UseCompressedOops
The JVM has a special setting that forces using 32bits index internally (read: pointers-like). That allows to address 4 294 967 295 objects * 8 bytes address => 32 GB of memory. (NOT to be confused with the 4GB address space for REAL pointers).
It reduces the overall memory consumption with a potential positive impact on all caching levels.
Real life example: ElasticSearch documentation states that a running 32GB 32bits node may be equivalent to a 40GB 64bits node in terms of actual data kept in memory.
A note on history: The flag was known to be unstable in pre-java-7 era (maybe even pre-java-6). It's been working perfectly in newer JVM for a while.
Java HotSpot™Virtual Machine Performance Enhancements
[...] In Java SE 7, use of compressed oops is the default for 64-bit JVM processes when -Xmx isn't specified and for values of -Xmx less than 32 gigabytes. For JDK 6 before the 6u23 release, use the -XX:+UseCompressedOops flag with the java command to enable the feature.
See: Once again the JVM is lights years ahead over manual tuning. Still, it's interesting to know about it =)
Special Case: -XX:+UseNUMA
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, the memory access time depends on the memory location relative to the processor. Source: Wikipedia
Modern systems have extremely complex memory architectures with multiple layers of memory and caches, either private and shared, across cores and CPU.
Quite obviously accessing a data in the L2 cache in the current processor is A LOT faster than having to go all the way to a memory stick from another socket.
I believe that all multi-socket systems sold today are NUMA by design, while all consumers systems are NOT. Check whether your server supports NUMA with the command numactl --show
on linux.
The NUMA-aware flag tells the JVM to optimize memory allocations for the underlying hardware topology.
The performance boost can be substantial (i.e. two digits: +XX%). In fact someone switching from a "NOT-NUMA 10CPU 100GB" to a "NUMA 40CPU 400GB" might experience a [dramatic] loss in performances if he doesn't know about the flag.
Note: There are discussions to detect NUMA and set the flag automatically in the JVM http://openjdk.java.net/jeps/163
Bonus: All applications intending to run on big fat hardware (i.e. NUMA) needs to be optimized for it. It is not specific to Java applications.
Toward the future: -XX:+UseG1GC
The latest improvement in Garbage Collection is the G1 collector (read: Garbage First).
It is intended for high cores, high memory systems. At the absolute minimum 4 cores + 6 GB memory. It is targeted toward databases and memory intensive applications using 10 times that and beyond.
Short version, at these sizes the traditional GC are facing too much data to process at once and pauses are getting out of hand. The G1 splits the heap in many small sections that can be managed independently and in parallel while the application is running.
The first version was available in 2013. It is mature enough for production now but it will not be going as default anytime soon. That is worth a try for large applications.
Do not touch: Generation Sizes (NewGen, PermGen...)
The GC split the memory in multiple sections. (Not getting into details, you can google "Java GC Generations".)
The last time I've been spending a week to try 20 different combination of generations flags on an app taking 10000 hit/s. I was getting a magnificent boost ranging from -1% to +1%.
Java GC generations are an interesting topic to read papers on or to write one about. They are not a thing to tune unless you're part of the 1% who can devote substantial time for negligible gains among the 1% of people who really need optimizations.
Conclusion
Hope this can help you. Have fun with the JVM.
Java is the best language and the best platform in the world! Go spread the love :D