I think, you should check the MACHINE_CLEARS.SMC
performance counter (part of MACHINE_CLEARS
event) of the CPU (it is available in Sandy Bridge 1, which is used in your Air powerbook; and also available on your Xeon, which is Nehalem 2 - search "smc"). You can use oprofile
, perf
or Intel's Vtune
to find its value:
http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.htm
Machine Clears
Metric Description
Certain events require the entire pipeline to be cleared and restarted from just after the last retired instruction. This metric measures three such events: memory ordering violations, self-modifying code, and certain loads to illegal address ranges.
Possible Issues
A significant portion of execution time is spent handling machine clears. Examine the MACHINE_CLEARS events to determine the specific cause.
SMC: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/win_reference/snb/events/machine_clears.html
MACHINE_CLEARS Event Code: 0xC3
SMC Mask: 0x04
Self-modifying code (SMC) detected.
Number of self-modifying-code machine clears detected.
Intel also says about smc http://software.intel.com/en-us/forums/topic/345561 (linked from Intel Performance Bottleneck Analyzer's taxonomy
This event fires when self-modifying code is detected. This can be typically used by folks who do binary editing to force it to take certain path (e.g. hackers). This event counts the number of times that a program writes to a code section. Self-modifying code causes a severe penalty in all Intel 64 and IA-32 processors. The modified cache line is written back to the L2 and LLC caches. Also, the instructions would need to be re-loaded hence causing performance penalty.
I think, you will see some such events. If they are, then CPU was able to detect act of self-modifying the code and raised the "Machine Clear" - full restart of pipeline. First stages are Fetch and they will ask L2 cache for new opcode. I'm very interested in the exact count of SMC events per execution of your code - this will give us some estimate about latencies.. (SMC is counted in some units where 1 unit is assumed to be 1.5 cpu cycles - B.6.2.6 of intel optimization manual)
We can see that Intel says "restarted from just after the last retired instruction.", so I think last retired instruction will be mov
; and your nops are already in the pipeline. But SMC will be raised at mov's retirement and it will kill everything in pipeline, including nops.
This SMC induced pipeline restart is not cheap, Agner has some measurements in the Optimizing_assembly.pdf - "17.10 Self-modifying code (All processors)" (I think any Core2/CoreiX is like PM here):
The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache.
...
Self-modifying code is not considered good programming practice. It should be used only if
the gain in speed is substantial and the modified code is executed so many times that the
advantage outweighs the penalties for using self-modifying code.
Usage of different linear addresses to fail SMC detector was recommended here:
https://mcmap.net/q/14534/-how-is-x86-instruction-cache-synchronized - I'll try to find actual intel documentation... Can't actually answer to your real question now.
There may be some hints here: Optimization manual, 248966-026, April 2012 "3.6.9 Mixing Code and Data":
Placing writable data in the code segment might be impossible to distinguish
from self-modifying code. Writable data in the code segment might suffer the
same performance penalty as self-modifying code.
and next section
Software should avoid writing to a code page in the same 1-KByte subpage that is
being executed or fetching code in the same 2-KByte subpage of that is being
written. In addition, sharing a page containing directly or speculatively executed
code with another processor as a data page can trigger an SMC condition that causes
the entire pipeline of the machine and the trace cache to be cleared. This is due to the
self-modifying code condition.
So, there is possibly some schematics which controls intersections of writable and executable subpages.
You can try to do modification from the other thread (cross-modifying code) -- but the very careful thread synchronization and pipeline flushing is needed (you may want to include some brute-forcing of delays in writer thread; CPUID just after the synchronization is desired). But you should know that THEY already fixed this using "nukes" - check US6857064 patent.
I'm a little confused about the manual mentioning P6 and Pentium processors
This is possible if you had fetched, decoded and executed some stale version of intel's instruction manual. You can reset the pipeline and check this version: Order Number: 325462-047US, June 2013 "11.6 SELF-MODIFYING CODE". This version still not says anything about newer CPUs, but mentions that when you are modifying using different virtual addresses, the behavior may be not compatible between microarchitectures (it may work on your Nehalem/Sandy Bridge and may not work on .. Skymont)
11.6 SELF-MODIFYING CODE
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.
In practice, the check on linear addresses should not create compatibility problems among IA-32 processors. Applications that include self-modifying code use the same linear address for modifying and fetching the instruction.
Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue. (See Section 8.1.3, “Handling Self- and Cross-Modifying Code,” for more information about the use of self-modifying code.)
For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction
REAL Update, googled for "SMC Detection" (with quotes) and there are some details how modern Core2/Core iX detects SMC and also many errata lists with Xeons and Pentiums hanging in SMC detector:
http://www.google.com/patents/US6237088 System and method for tracking in-flight instructions in a pipeline @ 2001
DOI 10.1535/itj.1203.03 (google for it, there is free version at citeseerx.ist.psu.edu) - the "INCLUSION FILTER" was added in Penryn to lower number of false SMC detections; the "existing inclusion detection mechanism" is pictured on Fig 9
http://www.google.com/patents/US6405307 - older patent on SMC detection logic
According to patent US6237088 (FIG5, summary) there is "Line address buffer" (with many linear addresses one address per fetched instruction -- or in other word the buffer full of fetched IPs with cache-line precision). Every store, or more exact "store address" phase of every store will be feed into parallel comparator to check, will store intersects to any of currently executing instructions or not.
Both patents don't clearly say, will they use physical or logical address in SMC logic... L1i in Sandy bridge is VIPT (Virtually indexed, physically tagged, virtual address for the index and physical address in the tag. ) according to http://nick-black.com/dankwiki/index.php/Sandy_Bridge so we have the physical address at time when L1 cache returns data. I think intel may use physical addresses in SMC detection logic.
Even more, http://www.google.com/patents/US6594734 @ 1999 (published 2003, just remember that CPU design cycle is around 3-5 years) says in the "Summary" section that SMC now is in TLB and uses physical addresses (or in other word - please, don't try to fool SMC detector):
Self modifying code is detected using a translation lookaside buffer .. [which] has physical page addresses stored therein over which snoops can be performed using the physical memory address of a store into memory. ... To provide finer granularity than a page of addresses, FINE HIT bits are included with each entry in the cache associating information in the cache to portions of a page within memory.
(portion of page, referred to as quadrants in the patent US6594734, sounds like 1K subpages, isn't it?)
Then they says
Therefore snoops, triggered by store instructions into memory, can perform SMC detection by comparing the physical address of all instructions stored within the instruction cache with the address of all instructions stored within the associated page or pages of memory. If there is an address match, it indicates that a memory location was modified. In the case of an address match, indicating an SMC condition, the instruction cache and instruction pipeline are flushed by the retirement unit and new instructions are fetched from memory for storage into the instruction cache.
Because snoops for SMC detection are physical and the ITLB ordinarily accepts as an input a linear address to translate into a physical address, the ITLB is additionally formed as a content-addressable memory on the physical addresses and includes an additional input comparison port (referred to as a snoop port or reverse translation port)
-- So, to detect SMC, they force the stores to forward physical address back to instruction buffer via snoop (similar snoops will be delivered from other cores/cpus or from DMA writes to our caches....), if snoop's phys. address conflicts with cache lines, stored in instruction buffer, we will restart pipeline via SMC signal delivered from iTLB to retirement unit. Can imagine how much cpu clocks will be wasted in such snoop loop from dTLB via iTLB and to retirement unit (it can't retire next "nop" instruction, although it was executed early than mov and has no side effects). But WAT? ITLB has physical address input and second CAM (big and hot) just to support and defend against crazy and cheating self-modifying code.
PS: And what if we will work with huge pages (4M or may be 1G)? The L1TLB has huge page entries, and there may be a lot of false SMC detects for 1/4 of 4 MB page...
PPS: There is a variant, that the erroneous handling of SMC with different linear addresses was present only in early P6/Ppro/P2...