Memory barriers and the TLB
Asked Answered
U

1

27

Memory barriers guarantee that the data cache will be consistent. However, does it guarantee that the TLB will be consistent?

I am seeing a problem where the JVM (java 7 update 1) sometimes crashes with memory errors (SIGBUS, SIGSEG) when passing a MappedByteBuffer between threads.

e.g.

final AtomicReference<MappedByteBuffer> mbbQueue = new AtomicReference<>();

// in a background thread.
MappedByteBuffer map = raf.map(MapMode.READ_WRITE, offset, allocationSize);
Thread.yield();
while (!inQueue.compareAndSet(null, map));


// the main thread. (more than 10x faster than using map() in the same thread)
MappedByteBuffer mbb = inQueue.getAndSet(null);

Without the Thread.yield() I occasionally get crashes in force(), put(), and C's memcpy() all indicating I am trying to access memory illegally. With the Thread.yield() I haven't had a problem, but that doesn't sound like a reliable solution.

Has anyone come across this problem? Are there any guarantees about TLB and memory barriers?


EDIT: The OS is Centos 5.7, I have seen the behaviour on i7 and a Dual Xeon machines.

Why do I do this? Because the average time to write a message is 35-100 ns depending on length and using a plain write() isn't as fast. If I memory map and clean up in the current thread this takes 50-130 microseconds, using a background thread to do it takes about 3-5 microseconds for the main thread to swap buffers. Why do I need to be swapping buffers at all? Because I am writing many GB of data and ByteBuffer cannot be 2+ GB in size.

Undry answered 30/11, 2011 at 12:11 Comment(12)
Peter, would you mind specifying the details of the OS and CPU models/configuration, in case this is pertinent?Ambrosius
@aix, good suggestion. It could matter.Undry
Have you tried with an older jdk? There are changes to the unmapper used in the directbuffer cleaner in jdk7. You may also want to try removing the cleaner call you make just to see if you are in some strange race condition with whatever else may be working with that phantomref.Seducer
The reason the cleaner is there is to avoid exhausting the virtual memory of the box. The buffer will be cleaned on a GC but since it is not producing much garbage, the machine runs out of virtual memory first and the application dies. Trying an older JDK is a good idea.Undry
@PeterLawrey: on the cleaner, it could be that you are racing the referencehandler thread if it gets kicked by the gc. If you are running out of memory without the clean, maybe you want to get the lock in Reference to keep the gc from setting the referencehandler in motion while you are cleaning. Not exactly sure how this all plays together (Cleaner.add/remove are synced so I don't really see how a race would unfold here) but manually running cleaners has high potential for interesting errors like the one you're having.Seducer
@philwb, Cleaners are paranoid to be executed once and actually Java contains code that does invoke them "manually", besides the in the ref-handler. So, it cannot be that. Invoking the cleaner pertains the race possibility of actually still using the mapped memory, though. Btw, the lack of normal unmap is a real issue and it doesn't have an easy solution as the mapped ByteBuffer's address can be in use by native code. Using locks would kill any performance, though.Grainy
@Grainy good point on the lock. Agreed on the map access looking the real culprit. Where else are cleaners manually called? A quick scan of the java sources only turned up the referencehandler for me. Thanks for the responses - always very enlightened.Seducer
look at sun.nio.ch.FileChannelImpl::unmap static method. it's something like: Cleaner cl = ((DirectBuffer)bb).cleaner(); if (cl!=null) cl.clean();Grainy
@Peter, did you have any progress w/ the communication?Grainy
@bestsss, Sending 16*4 byte packets I get a 99%tile latency of 150 ns or less. The 99.99%tile latency is 2.5 us.Undry
@Peter, this is quite pretty, so it's working actually?Grainy
Its working, I am working on making it simple to use. I intend to base the next version of HugeCollections on it.Undry
G
13

The mapping is done via mmap64 (FileChannel.map). When the address is accessed there will be a page fault and the kernel shall read/write there for you. TLB doesn't need to be updated during mmap.

TLB (of all cpus) is unvalidated during munmap which is handled by the finalization of the MappedByteBuffer, hence munmap is costly.

Mapping involves a lot synchronization so the address value shall not be corrupted.

Any chance you try fancy stuff via Unsafe?

Grainy answered 30/11, 2011 at 17:0 Comment(11)
Using Unsafe would be a next step, assuming I can get this stable. I do call ((DirectBuffer) buffer).cleaner().clean() to clean up the memory without waiting for a GC. ;)Undry
this is your problem... you are unmapping w/o all references gone: prime reason for SIGSEV. If you want to use it, wrap the ByteBuffer into something w/ a Reference w/ refCount and make sure you do support it well.Grainy
also manual unmap might not be super efficient as it flushes the TLB of all CPUs. When performed by the GC, it's usually in a bulk [i.e. multiple cleaner in the queue] and the effect is lesser, of course when the Cleaner is invoked is another unpleasant story.Grainy
I ensure there is only one reference to the MappedByteBuffer and only release it after the thread which alters it no longer has a reference.Undry
the weird part is that you get errors during force() that translates to fsync. fsync does not flush changes made via mmap, it's msync and it's not used by FileChannel. Anyways, the only way to get error during memcpy is accessing unmapped memory or a hardware error, however you told you used dual xeon that comes w/ ECC... I guess some other information is missing.Grainy
I would put this in the I-don't-know-what-I-don't-know category. ;) Using memory mapped files to send data between processes appears to be about 65-300 ns, which is much faster than sockets over loop back so I will persist with it.Undry
so in effect you are using shared memory, do you use /dev/shm as well?Grainy
I use /tmp which is mounted as a tmpfs for data I don't need to keep, but most of it I do so its on an ext4 + ssd. The ssd supports 0.5 GB/s ;)Undry
i guess i misunderstood which force you mean, MappedByteBuffer.force is actually sync msych. You still sure, the memory is mapped by that time? Try to check if cleaner has run [cleaner.next==cleaner]? Also make sure you do not use slice/duplicate/readonly, etc that create a view of the Buffer.Grainy
Thread A creates ByteBuffer and passes to Thread B when done. Thread B fills the buffer and passes back to Thread A. Thread A calls force() then clean(). When it fails is when Thread B attempts to put() to the buffer or Thread A calls force which is why a suspect a thread based TLB problem. Thread is managing the create and clean of ByteBuffers and Thread B just has to worry about filling it. ;)Undry
@Peter, btw, you dont need force before clean (unmap). you told you are attempting to use it as interprocess communication and you are getting the issues on the writing side only. TLB shall not matter, it's a cache after all, page faults are handled by the kernel anyways (and updates TLB)Grainy

© 2022 - 2024 — McMap. All rights reserved.