The biggest difference in performance will come when you are doing widely spaced random accesses to a large region of memory -- where "large" means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).
To make things more complex, the number of TLB entries for 4kB pages is often larger than the number of entries for 2MB pages, but this varies a lot by processor. There is also a lot of variation in how many "large page" entries are available in the Level 2 TLB.
For example, on an AMD Opteron Family 10h Revision D ("Istanbul") system, cpuid reports:
- L1 DTLB: 4kB pages: 48 entries; 2MB pages: 48 entries; 1GB pages: 48 entries
- L2 TLB: 4kB pages: 512 entries; 2MB pages: 128 entries; 1GB pages: 16 entries
While on an Intel Xeon 56xx ("Westmere") system, cpuid reports:
- L1 DTLB: 4kB pages: 64 entries; 2MB pages: 32 entries
- L2 TLB: 4kB pages: 512 entries; 2MB pages: none
Both can map 2MB (512*4kB) using small pages before suffering level 2 TLB misses, while the Westmere system can map 64MB using its 32 2MB TLB entries and the AMD system can map 352MB using the 176 2MB TLB entries in its L1 and L2 TLBs. Either system will get a significant speedup by using large pages for random accesses over memory ranges that are much larger than 2MB and less than 64MB. The AMD system should continue to show good performance using large pages for much larger memory ranges.
What you are trying to avoid in all these cases is the worst case (note 1) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (note 2) work, it requires:
- 5 trips to memory to load data mapped on a 4kB page,
- 4 trips to memory to load data mapped on a 2MB page, and
- 3 trips to memory to load data mapped on a 1GB page.
In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information.
The best description I have seen is in Section 5.3 of AMD's "AMD64 Architecture Programmer’s Manual Volume 2: System Programming" (publication 24593) http://support.amd.com/us/Embedded_TechDocs/24593.pdf
Note 1: The figures above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.
Note 2: Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.