Do the ARM instructions ldrex/strex have to operate on cache aligned data?
Asked Answered
S

3

3

On Intel, the arguments to CMPXCHG must be cache line aligned (since Intel uses MESI to implement CAS).

On ARM, ldrex and strex operate on exclusive reservation granuales.

To be clear, does this then mean on ARM the data being operated upon does not have to be cache line aligned?

Sloppy answered 8/7, 2012 at 12:23 Comment(0)
S
1

It says so right in the ARM Architecture Reference Manual A.3.2.1 "Unaligned data access". LDREX and STREX require word alignment. Which makes sense, because an unaligned data access can span exclusive reservation granules.

Sorensen answered 8/7, 2012 at 14:4 Comment(12)
I read ERG length is between 8 and 2048 bytes, in multiples of two. If ERG length is say 10 bytes, you would cross the ERG boundary with an aligned access. Is ERG length something other than multiples of two?Sloppy
word alignment and cache line aligned are two different thingsHegelian
in the x86 world a "word" is 16 bits, two bytes. in arm a "word" is 32 bits, so word aligned means the two lsbits of the address are zeroHegelian
@BlankXavier A.3.4.3 says that the ERG size is a power of two.Sorensen
@dwelch: on Intel I cache-line align the CAS targets and pad their cache line so they're not disturbed by other activity nor do they disturb others. On ARM, I was doing the same (force of habit) but then ran into the problem of needing to align against cache-line AND ERG boundary, and wanting to compute that value in a #define (which is impossible) so people could if they wished use the stack for allocation. My concern here really is not normal alignment (e.g. word alignment) but cache-line alignment (a la Intel) - is it necessary on ARM. The answer is no (although word alignment is).Sloppy
@Chen: can you provide a ref to your source? I have this link infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0008a/… where I find "The ERG is implementation defined, in the range 8-2048 bytes, in multiples of two bytes."Sloppy
@BlankXavier I thought I already gave the reference: A.3.4.3 of the ARM Architecture Reference Manual. "Tagged_address = Memory_address[31:a]. The value of a in this assignment is IMPLEMENTATION DEFINED, between a minimum value of 3 and a maximum value of 11. The size of the tagged memory block called the Exclusives Reservation Granule."Sorensen
@BlankXavier Finding an URL for the ARM Architecture Reference Manual is left as an exercise.Sorensen
@Chen: I think I have the URL for the ARM ARM on arm.com - the reason I ask for a ref is because it is not obvious how to find that subsection, or that it exists in the on-line version; and a google.com site:arm.com search does not find that string or substrings of it.Sloppy
@BlankXavier I had no trouble searching on the text.Sorensen
@Chen: I limited that particular substring search to arm.com... :-/ also I was looking for the ARM ARM; I saw when searching references to the errata but dismissed them... anyways, I have it now. Thankyou!Sloppy
there are multiple arm-arms, should use the one most closely related to the family/core. Also will need the trm, and the amba/axi spec. start by searching for ldrex or strex (in the arm arm or trm) and then maybe the word exclusive or shared.Hegelian
H
2

Exclusive access restrictions

The following restrictions apply to exclusive accesses:

• The size and length of an exclusive write with a given ID must be the same as the size and length of the preceding exclusive read with the same ID.

• The address of an exclusive access must be aligned to the total number of bytes in the transaction.

• The address for the exclusive read and the exclusive write must be identical.

• The ARID field of the read portion of the exclusive access must match the AWID of the write portion.

• The control signals for the read and write portions of the exclusive access must be identical.

• The number of bytes to be transferred in an exclusive access burst must be a power of 2, that is, 1, 2, 4, 8, 16, 32, 64, or 128 bytes.

• The maximum number of bytes that can be transferred in an exclusive burst is 128.

• The value of the ARCACHE[3:0] or AWCACHE[3:0] signals must guarantee that the slave that is monitoring the exclusive access sees the transaction. For example, an exclusive access being monitored by a slave must not have an ARCACHE[3:0] or AWCACHE[3:0] value that indicates that the transaction is cacheable.

Failure to observe these restrictions causes Unpredictable behavior.

The above is from the AMBA/AXI spec. You will find that AWLOCK/ARLOCK is ignored by some vendors (meaning ldrex/strex wont work outside the core). I have some code that demonstrates this, or at least will if you find a system that doesnt support exclusive access.

https://github.com/dwelch67/raspberrypi/tree/master/extest

Depending on the task and how portable you want to be you may need to provide swp and ldrex/strex solutions surrounded by ifdefs and/or use the plethora of registers available (runtime) to tell you what instructions are or are not supported by the core you are running on. (you may find in at least one case neither swp nor ldrex/strex are supported).

Hegelian answered 8/7, 2012 at 15:32 Comment(0)
S
1

It says so right in the ARM Architecture Reference Manual A.3.2.1 "Unaligned data access". LDREX and STREX require word alignment. Which makes sense, because an unaligned data access can span exclusive reservation granules.

Sorensen answered 8/7, 2012 at 14:4 Comment(12)
I read ERG length is between 8 and 2048 bytes, in multiples of two. If ERG length is say 10 bytes, you would cross the ERG boundary with an aligned access. Is ERG length something other than multiples of two?Sloppy
word alignment and cache line aligned are two different thingsHegelian
in the x86 world a "word" is 16 bits, two bytes. in arm a "word" is 32 bits, so word aligned means the two lsbits of the address are zeroHegelian
@BlankXavier A.3.4.3 says that the ERG size is a power of two.Sorensen
@dwelch: on Intel I cache-line align the CAS targets and pad their cache line so they're not disturbed by other activity nor do they disturb others. On ARM, I was doing the same (force of habit) but then ran into the problem of needing to align against cache-line AND ERG boundary, and wanting to compute that value in a #define (which is impossible) so people could if they wished use the stack for allocation. My concern here really is not normal alignment (e.g. word alignment) but cache-line alignment (a la Intel) - is it necessary on ARM. The answer is no (although word alignment is).Sloppy
@Chen: can you provide a ref to your source? I have this link infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0008a/… where I find "The ERG is implementation defined, in the range 8-2048 bytes, in multiples of two bytes."Sloppy
@BlankXavier I thought I already gave the reference: A.3.4.3 of the ARM Architecture Reference Manual. "Tagged_address = Memory_address[31:a]. The value of a in this assignment is IMPLEMENTATION DEFINED, between a minimum value of 3 and a maximum value of 11. The size of the tagged memory block called the Exclusives Reservation Granule."Sorensen
@BlankXavier Finding an URL for the ARM Architecture Reference Manual is left as an exercise.Sorensen
@Chen: I think I have the URL for the ARM ARM on arm.com - the reason I ask for a ref is because it is not obvious how to find that subsection, or that it exists in the on-line version; and a google.com site:arm.com search does not find that string or substrings of it.Sloppy
@BlankXavier I had no trouble searching on the text.Sorensen
@Chen: I limited that particular substring search to arm.com... :-/ also I was looking for the ARM ARM; I saw when searching references to the errata but dismissed them... anyways, I have it now. Thankyou!Sloppy
there are multiple arm-arms, should use the one most closely related to the family/core. Also will need the trm, and the amba/axi spec. start by searching for ldrex or strex (in the arm arm or trm) and then maybe the word exclusive or shared.Hegelian
P
1

On Intel, the arguments to CMPXCHG do NOT need to be cache aligned. Try it, you will see that it works.

But, you are correct: in cacheable memory, Intel does use the cache protocol to implement CMPXCHG. So, you would be smart to not put two independent high usage synchronization variables in the same cache line - because if two processors were synchronizing using these different variables, cache lines might be thrashing back and forth. But this is exactly the same issue as for any data: you don't different processors to be writing to the same cacheline at the same time. False sharing.

But you certainly can do not cache line aligned locks:

struct Foo {
  int data;
  Lock lock;
  int data_after;
};

You can put different locks in the same cacheline:

struct Foo {
  int data;
  Lock read_lock;
  int data_between;
  Lock write_lock;
  int data_after;
};

Since reading and writing tend to be mutually exclusive, there may be no lossage;

You can put different locks in the same cacheline:

struct Foo {
  int data;
  Lock read_lock;
  int data_between;
  Lock write_lock;
  int data_after;
};

By the way, in uncached memory Intel does not use the cache snooping protocol for atomic operations like CMPXCHG. So there is less reason to cache line align synchronization variables. But you still may want to: many memory subsystems interleave by cacheline size, even when uncached.

And as for ARM: it is pretty much the same.

On a snoopy bus, or uncached, you may not need to worry too much about cache line alignment.

But in a clustered cache hierarchy, you have exactly the same issues as x86. More so, in fact, it is well known how to "export" operations like CMPXCHG, but not ARM ldrexd/strexd.

Pitchstone answered 18/7, 2013 at 0:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.