If you can get access to a "Westmere" based system the performance characteristics of your code should be quite similar to what you have on the "Nehalem", but you will have access to a new hardware performance counter event that measures almost exactly what you want.
On Westmere, the best estimate of performance lost while waiting for TLB misses to be handled is probably from the hardware performance counter Event 08H, Mask 04H "DTLB_LOAD_MISSES.WALK_CYCLES", which is described as counting "Cycles Page Miss Handler is busy with a page walk due to a load miss in the Second Level TLB".
This is described in "Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3B: System Programming Guide, Part 2" (document number: 253669), available online at
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html
The reason this event is necessary is that TLB miss processing time is dominated by the time required to read the cache line containing the page table entry. If that cache line is in the L2 cache, then the overhead of a TLB misses will be very small (of the order of 10 cycles). If the line is in the L3 cache, then maybe 25 cycles. If the line is in memory, then ~200 cycles.
- If there is also a miss in the upper-level page translation caches, it will take multiple trips to memory to find and retrieve the desired page table entry (e.g., https://mcmap.net/q/14525/-in-what-circumstances-can-large-pages-produce-a-speedup).
- On some processors the L2 cache counters can tell you how many table walks hit and missed in the L2, but not on Nehalem. (It would not help a lot in this case since TLB walks that hit in the L3 are also fairly fast and what you really want are the TLB walks that have to go to memory.)