Understanding internal fragmentation properties of Hotspot JVM process

For both on-heap and off-heap allocations. On-heap - in the context of three major garbage collectors: CMS, Parallel Old and and G1.

What I know (or think that I know) to the moment:

all object (on-heap) allocations are rounded up to 8 bytes boundary (or larger power of 2, configured by -XX:ObjectAlignmentInBytes.
G1
- For on-heap allocations smaller than the region size (1 to 32 MB, likely around heap size / 2048) there is no internal fragmentation, because there is no need, because the allocator never "fills holes".
- For allocations larger the region size, it rounds up allocation to the region size. I. e. allocation of the region size + 1 byte is very unlucky, it wastes almost 50% of memory.
For CMS, the only relevant information I found is

Naturally old space PLABs mimic structure of indexed free list space. Each thread preallocates certain number of chunk of each size below 257 heap words (large chunk allocated from global space).

From http://blog.ragozin.info/2011/11/java-gc-hotspots-cms-promotion-buffers.html. As far as I understand, referred "global space" is the main old space.

Questions:

Are the above statements correct?
What are the fragmentation properties of the main old space in CMS? What about allocations of more than "257 heap words"?
How the old space is managed with Parallel Old GC?
Does Hotspot JVM use the system memory allocator for off-heap allocations, or it re-manages it with a specific allocator?

UPD. A discussion thread: https://groups.google.com/forum/#!topic/mechanical-sympathy/A-RImwuiFZE

As far as I understand, the statements above are correct, although the bit on CMS is missing a lot of context to interpret it.
CMS is prone to fragmentation (in its old space, where CMS runs), which is one of its major flaws. If it fragments too much, it may occasionally have to stop the world and do a full mark and (sliding) compaction to remove the fragmentation, which leads to a large pause in the application. It is this flaw that is often cited as why G1 was developed. Some systems (e.g. HBase) purposely do most of their allocations with fixed size blocks in order to prevent or significantly reduce fragmenting CMS to avoid long stop-the-world pauses.
ParallelOldGC (or 'Old GC' in general) does not fragment. Objects are tenured to the old heap and when it runs out of space, a full mark and compact cycle is run. It can do this full GC faster than any of the other allocators, but with a typical run time of 1 second per 2 GB of heap, this can be too long for large heaps or latency sensitive applications.
Hotspot has used various strategies for off-heap allocation depending on the purpose. Allocating native byte buffers differs from its own allocation for compiled code or profiling data. I can not answer with authority here on any details, but I can only assume that much of this does not use the system allocator, else Hotspot would not perform as well as it does. Furthermore, there are parameters one can tune that control some of this space, e.g. -XX:ReservedCodeCacheSize, which suggests such a region of memory is managed through indirection and not directly via the system allocator. In short I would be rather surprised if the system allocator was directly used for any fine-grained allocation at all in hotspot.

Recommended topics

Hot tags