How to use align-data-move SSE in Delphi XE3?
Asked Answered
C

4

7

I was trying to run the following,

type
  Vector = array [1..4] of Single;

{$CODEALIGN 16}
function add4(const a, b: Vector): Vector; register; assembler;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

It gives Access Violation on movaps, as far as I know, the movaps can be trusted if the memory location is 16-align. It works no problem if movups (no align is needed).

So my question is, in Delphi XE3, {$CODEALIGN} seems not working in this case.

EDIT

Very strange... I tried the following.

program Project3;

{$APPTYPE CONSOLE}

uses
  windows;  // if not using windows, no errors at all

type
  Vector = array [1..4] of Single;

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

procedure test();
var
  v1, v2: vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1,v2);  // this works
end;

var
  a, b, c: Vector;

begin
  {$ifndef cpux64}
    {$MESSAGE FATAL 'this example is for x64 target only'}
  {$else}
  test();
  c := add4(a, b); // throw out AV here
  {$endif}
end.

If 'use windows' is not added, everything is fine. If 'use window', then it will throw out exception at c := add4(a, b) but not in test().

Who can explain this?

EDIT it all makes sense to me, now. the conclusions for Delphi XE3 - 64-bit are

  1. stack frames at X64 are set to 16-byte (as required), {$CODEALIGN 16} aligns code for proc/fun to 16 byte.
  2. the dynamic array lives in heap, which can be set to align 16 using SetMinimumBlockAlignment(mba16byte)
  3. however, the stack vars are not always 16-byte aligned, for example, if you declare a integer var before v1, v2 in the above example, e.g. test(), the example will not work.
Coulter answered 4/4, 2013 at 1:57 Comment(3)
CODEALIGN aligns code. If you want to align data you can use the ALIGN directive.Encase
i tried {$ALIGN 16} as well, and it is not working.Coulter
movaps xmm1, [b] is pointless. Use addps xmm0, [b] if your inputs are aligned.Rescission
F
4

You need your data to be 16 byte aligned. That requires some care and attention. You can make sure that the heap allocator aligns to 16 bytes. But you cannot make sure that the compiler will 16 byte align your stack allocated variables because your array has an alignment property of 4, the size of its elements. And any variables declared inside other structures will also have 4 byte alignment. Which is a tough hurdle to clear.

I don't think you can solve your problem in the currently available versions of the compiler. At least not unless you forgo stack allocated variables which I'd guess to be too bitter a pill to swallow. You might have some luck with an external assembler.

Faythe answered 4/4, 2013 at 20:21 Comment(10)
OK... is it easier to do so in C++?Coulter
The MS compiler doesn't allow 64 bit inline asm. I expect gcc does. And I'm sure gcc will give you the ability to align stack variables.Faythe
Very strange behavior, please have a look at my latest edits.Coulter
Not strange at all. Your array has 4 byte alignment. By chance it might land on 16 boundary. And then the code works. By chance.Faythe
then, why the v1 and v2 (declared in test fun) works? If the windows unit is added, it doesn't work for the a, b and c defined globally, but it works everytime when windows unit is not added.Coulter
Because changing the code changes where the globals happen to be located. The locals in test fun probably work everytime because the stack is 16 byte aligned. But again, code changes could break that.Faythe
Thanks! now it is all clear to me. I've put a bit of conclusion. Feel Free to change.Coulter
Point 1 is not quite right. CODEALIGN makes sure that procedure entry points are aligned on specified boundaries. The x64 ABI specifies that stack pointers are 16 byte aligned. The code location is a different thing from the stack frame. Remember that each invocation gets a new stack frame, but the code is at a fixed location.Faythe
thanks... so what are the benefits of aligning proc entry to 16 bytes?Coulter
There can be performance benefits but it's not something I've ever encountered.Faythe
C
2

You can write your own memory allocation routines that allocate aligned data in the heap. You can specify your own alignment size (not just 16 bytes but also 32 bytes, 64 bytes and so on...):

    procedure GetMemAligned(const bits: Integer; const src: Pointer;
      const SrcSize: Integer; out DstAligned, DstUnaligned: Pointer;
      out DstSize: Integer);
    var
      Bytes: NativeInt;
      i: NativeInt;
    begin
      if src <> nil then
      begin
        i := NativeInt(src);
        i := i shr bits;
        i := i shl bits;
        if i = NativeInt(src) then
        begin
          // the source is already aligned, nothing to do
          DstAligned := src;
          DstUnaligned := src;
          DstSize := SrcSize;
          Exit;
        end;
      end;
      Bytes := 1 shl bits;
      DstSize := SrcSize + Bytes;
      GetMem(DstUnaligned, DstSize);
      FillChar(DstUnaligned^, DstSize, 0);
      i := NativeInt(DstUnaligned) + Bytes;
      i := i shr bits;
      i := i shl bits;
      DstAligned := Pointer(i);
      if src <> nil then
        Move(src^, DstAligned^, SrcSize);
    end;

    procedure FreeMemAligned(const src: Pointer; var DstUnaligned: Pointer;
      var DstSize: Integer);
    begin
      if src <> DstUnaligned then
      begin
        if DstUnaligned <> nil then
          FreeMem(DstUnaligned, DstSize);
      end;
      DstUnaligned := nil;
      DstSize := 0;
    end;

Then use pointers and procedures as a third argument to return the result.

You can also use functions, but it is not that evident.

type
  PVector^ = TVector;
  TVector  = packed array [1..4] of Single;

Then allocate these objects that way:

const
   SizeAligned = SizeOf(TVector);
var
   DataUnaligned, DataAligned: Pointer;
   SizeUnaligned: Integer;
   V1: PVector;
begin
  GetMemAligned(4 {align by 4 bits, i.e. by 16 bytes}, nil, SizeAligned, DataAligned, DataUnaligned, SizeUnaligned);
  V1 := DataAligned;
  // now you can work with your vector via V1^ - it is aligned by 16 bytes and stays in the heap

  FreeMemAligned(nil, DataUnaligned, SizeUnaligned);
end;

As you have pointed out, we have passed nil to GetMemAligned and FreeMemAligned - this parameter is needed when we want to align existing data, e.g. one which we have received as a function argument, for example.

Just use straight register names rather than parameter names in assembly routines. You will not mess anything with that when using register calling convension - otherwise you risk to modify the registers without knowing that the parameter names used are just aliases for the registers.

Under Win64, with Microsoft calling convention, first parameter is always passed as RCX, second - RDX, third R8, fourth - R9, the rest in stack. A function returns the result in RAX. But if a function returns a structure ("record") result, it is not returned in RAX, but in an implicit argument, by address. The following registers may be modifyed by your function after the call: RAX,RCX,RDX,R8,R9,R10,R11. The rest should be preserved. See https://msdn.microsoft.com/en-us/library/ms235286.aspx for more details.

Under Win32, with Delphi register calling convention, a call passes first parameter in EAX, second in EDX, third in ECX, and rest in stack

The following table summarizes the differences:

         64     32
         ---   ---
    1)   rcx   eax
    2)   rdx   edx
    3)   r8    ecx
    4)   r9    stack

So, your function will look like this (32-bit):

procedure add4(const a, b: TVector; out Result: TVector); register; assembler;
asm
  movaps xmm0, [eax]
  movaps xmm1, [edx]
  addps xmm0, xmm1
  movaps [ecx], xmm0
end;

Under 64-bit;

procedure add4(const a, b: TVector; out Result: TVector); register; assembler;
asm
  movaps xmm0, [rcx]
  movaps xmm1, [rdx]
  addps xmm0, xmm1
  movaps [r8], xmm0
end;

By the way, according to Microsoft, floating point arguments in 64-bit calling convention are passed in direct in the XMM registers: first in XMM0, second in XMM1, third in XMM2, and fourth in XMM3, and rest in stack. So you can pass them by value, not by reference.

Conchitaconchobar answered 14/7, 2017 at 7:48 Comment(0)
B
1

Use this to make the built-in memory manager allocate with 16-byte alignment:

SetMinimumBlockAlignment(mba16Byte);

Also, as far as I know, both "register" and "assembler" are redundant directives so you can skip those from your code.

--

Edit: you mention this is for x64. I just tried the following in Delphi XE2 compiled for x64 and it works here.

program Project3;

type
  Vector = array [1..4] of Single;

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

procedure f();
var
  v1,v2 : vector;
begin
  v1[1] := 1;
  v2[1] := 1;
  v1 := add4(v1,v2);
end;

begin
  {$ifndef cpux64}
  {$MESSAGE FATAL 'this example is for x64 target only'}
  {$else}
  f();
  {$endif}
end.
Bucksaw answered 4/4, 2013 at 7:38 Comment(20)
I tried your solution, and it doesn't work. As far as I can see, the implementation of SetMinimumBlockAlignment has no effects under 64-bit, It has a comment {16-byte alignment is required under 64-bit.}.Coulter
no, it doesn't work on my PC, it throws Access Violation exception.Coulter
@DoctorLai Do you compile for 64-bit? Set Target Platform "64-bit Windows" in Project Manager in Delphi IDE.Bucksaw
@DoctorLai In what place of program do you call SetMinimumBlockAlignment?Sinuation
@VilleKrumlinde Yes, it is for 64-bitCoulter
@Sinuation At the begining of the console application, right after 'begin'Coulter
@DoctorLai I've used it in intialization section, because "Existing allocations are not affected if this setting is changed". But it seems that setting affects on dynamic memory allocation (GetMem etc), but not on static variables on heap (global) and on stack (local) in Win32Sinuation
@Sinuation ok, thanks.. it makes more sense now. But it is a pity that this can't be used in simple function.Coulter
@DoctorLai Another suggestion is to skip alignment and use MOVUPS instead. I've heard the performance difference is small on newer CPUs.Bucksaw
This only works if the data lives in a correctly aligned structure. You need 16 byte alignment as well as memory manager support. Also your example uses stack variables. They cannot be 16 byte aligned I think.Faythe
@DoctorLai What is your real task? Significant gain with SSE is achieved usually for treatment of big data arrays. It is not difficult to allocate them dynamically.Sinuation
@DavidHeffernan On x64 the stack is always kept aligned at 16 byte boundaries (for reference Google "x64 stack alignment"). See my updated code which works with x64 target.Bucksaw
It's not enough. The compiler still fails to align individual local variables.Faythe
@DavidHeffernan You're right. What a shame, I would have expected the compiler to at least align stack variables whose address is used (via pointer, @, var parameter etc.).Bucksaw
@doctor why did you change the accept? Do you disagree with what I say about stack alignment?Faythe
@Sinuation I want to speed up the vector operations, such as dot product, cross product etc.Coulter
@DavidHeffernan I tried the code example, and it works now... sorry, that I may not fully understand why it works now... let me read again all comments.Coulter
Delphi compiler won't align each variable. So even if the stack frame itself is 16 byte aligned, individual variables may be mis aligned. It has always been this way. It's been a performance issue forever with slow access to doubles on x86.Faythe
@VilleKrumlinde yes, your solution works, however, if the Vector: v1 and v2 are declared as global vars, it will not work.Coulter
The big mistake in all the "it works" comments here is that, if by chance, the data happens to be aligned the code won't fail. But it's still broken because alignment is achieved by chance and not guaranteed.Faythe
N
1

To ensure proper alignment of the fields in an unpacked record type, the compiler inserts an unused byte before fields with an alignment of 2, and up to 3 unused bytes before fields with an alignment of 4, if required. Finally, the compiler rounds the total size of the record upward to the byte boundary specified by the largest alignment of any of the fields.

https://docwiki.embarcadero.com/RADStudio/Alexandria/en/Internal_Data_Formats_(Delphi)#Record_Types

for next record we has next table if use {$ALIGN 16}

type
  TAlignType = Extended;
  TRec = record
    first : byte;
    second: TAlignType;
  end;

 TAlignType      |  Byte    Word    Integer     Int64     Extended
-------------------------------------------------------------------
 Record Align    |    1       2        4          8          16
 Record Size     |    2       4        8          16         32
                 |
 Align bytes     |    0       1        3          7          15
after field first|

Thus, in order for the structure to be aligned to 16 bytes, you must add a field larger than 8 bytes

This example works fine in x86/x64. i add dummy field with Extended type.

{$EXTENDEDCOMPATIBILITY ON} // To use 10 byte size for Win64. Extended type has 8 byte size for Win64 and 10 byte for Win32.
{$ALIGN 16}
type
  Vector = record
    case Integer of
      0: (dw: array[0..4-1]of Single);
      8: (a: Extended); // 10 byte field to use alignment of record field
      // https://docwiki.embarcadero.com/RADStudio/Alexandria/en/Internal_Data_Formats_(Delphi)#Record_Types
  end;

function add4(const a, b: Vector): Vector;
asm
  movaps xmm0, [a]
  movaps xmm1, [b]
  addps xmm0, xmm1
  movaps [@result], xmm0
end;

procedure test();
var
  dump: Integer; // 4 byte
  a, b: Vector;
begin
  a.dw[0] := 1;
  b.dw[0] := 1;
  a := add4(a, b);  // this works
end;

var
  a, b, c: Vector;
begin
  test();
  c := add4(a, b);
  Readln;
end.
Napoleonnapoleonic answered 11/7, 2023 at 10:21 Comment(3)
so this {$ALIGN 16} is a new thing?Coulter
Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, can you edit your answer to include an explanation of what you're doing and why you believe it is the best approach?Daveen
a compiler directive {$ALIGN 16} affected the alignment of a struct when I added a field of type extended to itNapoleonnapoleonic

© 2022 - 2024 — McMap. All rights reserved.