Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.
The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.
The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.
This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html
Update:
I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.
// Compile with:
// gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.
#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;
unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));
void* thread1(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n1[mask]++;
x = (v4si){0,0,0,0};
}
return NULL;
}
void* thread2(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n2[mask]++;
x = (v4si){-1,-1,-1,-1};
}
return NULL;
}
int main() {
// Check memory alignment
if ( (((uintptr_t)&x) & 0x0f) != 0 )
abort();
memset(n1, 0, sizeof(n1));
memset(n2, 0, sizeof(n2));
pthread_t t1, t2;
pthread_create(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
for (unsigned i=0; i<16; i++) {
for (int j=3; j>=0; j--)
printf("%d", (i>>j)&1);
printf(" %10u %10u", n1[i], n2[i]);
if(i>0 && i<0x0f) {
if(n1[i] || n2[i])
printf(" Not a single memory access!");
}
printf("\n");
}
return 0;
}
The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:
0000 96905702 10512
0001 0 0
0010 0 0
0011 22 12924 Not a single memory access!
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 3092557 1175 Not a single memory access!
1101 0 0
1110 0 0
1111 1719 99975389
FLD m80
on Intel Haswell is 4 total uops: 2 ALU and 2 load-port. So it might well be implemented internally as a 64b load and a 16b load. 80bit FP is not a performance priority. 80bit FP stores take 7 uops, with a throughput of one per 5 cycles. – Nonparous