Why is there no Z80 like LDIR functionality in C/C++/rtl?
Asked Answered
H

14

4

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.

LD HL, DESTINATION             ; point to the source
LD DE, DESTINATION + 1         ; point to the destination
LD BC, DESTINATION_SIZE - 1    ; copying this many bytes
LD (HL), 0X20                  ; put a seed space in the first position
LDIR                           ; move 1 to 2, 2 to 3...

The result being that the chunk of memory at DESTINATION is completely blank filled. I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.

Why do memmove and memcpy behave this way?

Is there any reasonable way to do this sort of array initialization?

I am already aware of char array[size] = {0} for array initialization

I am already aware that memset will do the job for single characters.

What other approaches are there to this issue?

Hydrolysis answered 22/12, 2008 at 22:50 Comment(0)
C
7

I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.

Caulicle answered 22/12, 2008 at 23:35 Comment(0)
F
14

There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:

  • saves the Stack Pointer and then moves it to the end of the screen.
  • LOADs the HL register pair with zero,
  • goes into a massive loop PUSHing HL onto the Stack.
  • The Stack moves up the screen and down through memory and in the process, clears the screen.

The explanation above was taken from the review of David Webbs game Starion.

The Z80 routine might look a little like this:

  DI              ; disable interrupts which would write to the stack.
  LD HL, 0
  ADD HL, SP      ; save stack pointer
  EX DE, HL       ; in DE register
  LD HL, 0
  LD C, 0x18      ; Screen size in pages
  LD SP, 0x4000   ; End of screen
PAGE_LOOP:
  LD B, 128       ; inner loop iterates 128 times
LOOP:
  PUSH HL         ; effectively *--SP = 0; *--SP = 0;
  DJNZ LOOP       ; loop for 256 bytes
  DEC C
  JP NZ,PAGE_LOOP
  EX DE, HL
  LD SP, HL       ; restore stack pointer
  EI              ; re-enable interrupts

However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:

LOOP:
   PUSH HL
   PUSH HL
   ...
   PUSH HL         ; repeat 128 times
   DEC C
   JP NZ,LOOP

That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.

Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.

Flitter answered 23/12, 2008 at 8:56 Comment(5)
It's been mumble years since I did anything with a Z80, but that looks good to me. I'd add 'LD SP, DE' at the end, of course.Barbitone
An even faster way I have used is to put several "PUSH HL" instructions inside the loop. So, if you were clearing, say, 2K of memory, you might use 16 "PUSH HL"s and only loop around 2K/16 (256) times.Afrikaans
DEC does not set zero flag. Actually it doesn't set any flag.Hammerlock
16-bit DEC doesn't set any flags, but 8-bit DEC does. Rewriting the loop to an inner loop over C and an outer loop over B would take care of that problem, as would use of DJNZ which is IIRC faster then DEC B; JNZ LOOP separately. Of course, that would require the inner loop be over B...Coroneted
As @Coroneted said, 16-bit DEC instructions don't affect flags, so an 8-bit counter has to be used (or a nested loop with 8-bit counters, if necessary). Also, I would add DI at the beginning and EI at the end to not break interrupts.Illinium
C
13

memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.

If you want to fill memory, use memset, which is designed to do just what you want.

Cass answered 22/12, 2008 at 23:32 Comment(0)
C
7

I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.

Caulicle answered 22/12, 2008 at 23:35 Comment(0)
M
7

The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.

Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.

Mohl answered 23/12, 2008 at 4:2 Comment(0)
H
6

This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.

mov esi, source    ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0  ; initialize the first byte with the "seed"
mov ecx, 100h      ; set ecx to the size of the buffer
rep movsb          ; do the fill

However, it is simply more efficient to set more than one byte at a time if you can.

Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.

Hanhhank answered 23/12, 2008 at 4:21 Comment(3)
On x86 rep stosd with ecx=40h would be MUUCH faster and I think that's why we should avoid hacks but rather stick to a simple memset() call instead :)Hammerlock
@ssg: yes, rep stosd would be more efficient, but I was trying to demonstrate code which acted just like the OPs. I also noted in my post that setting more than one byte at a time would be more efficient.Hanhhank
yep, I had that hunch. I was aiming at OP's point of view rather than yours :)Hammerlock
Y
5

Why do memmove and memcpy behave this way?

Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)

The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.

Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?

Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.

Yea answered 22/12, 2008 at 22:54 Comment(2)
I'm not looking for a z80 compiler. I'm looking for and ldir like method to initialize buffers.Hydrolysis
I didn't answer since I don't understand the issue. The way to initialise buffers in C++ is with std::fill (or memset, or wmemset, or non-portable equivalents for larger values). Why don't you like that? What's the motivation for the requirement "must be LDIR-like", are you just fond of the idiom?Adduct
M
5

If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.

Mortensen answered 22/12, 2008 at 23:34 Comment(0)
E
3

If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.

It might be that memcpy() might also use a similar sequence on that compiler.

But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?

Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.

Er answered 23/12, 2008 at 8:27 Comment(0)
C
2

There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.

Calderon answered 22/12, 2008 at 22:56 Comment(0)
O
2

Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):

template <int S>
class A
{
  char s_[S];
public:
  A()
  {
    for(int i = 0; i < S; ++i)
    {
      s_[i] = 'A';
    }
  }
  int MaxLength() const
  {
    return S;
  }
};

extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all

void test()
{
  A<5> a5;
  useA(a5, a5.MaxLength());
}

The assembler output is the following:

test PROC

[snip]

; 25   :    A<5> a5;

mov eax, 41414141H              ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al

; 26   :    useA(a5, a5.MaxLength());

lea eax, DWORD PTR a5[esp+40]
push    5               ; MaxLength()
push    eax
call    useA

It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.

Ordure answered 23/12, 2008 at 9:14 Comment(4)
If that output was from objedump, you should pass the -C option, it'll decode c++ names :)Hanhhank
Thanks, but the output was directly from the compiler, I could of course tidy it up a bit...Ordure
Your example is not a good one because the compiler detects that the array has only 5 bytes, so does a 4 byte and a 1 byte store operation from eax. This would look quite different when using a significantly bigger array size.Jaella
Of course, but the whole point was to illustrate that fancy-pancy optimization techniques used in the olden days writing Z80 asm is not necessary any longer. Using a larger value for S would yield a call to memset(), which would most likely do rep stosd (+ alignment).Ordure
G
2

If you're on the PowerPC, _dcbz().

Gluey answered 14/1, 2009 at 18:13 Comment(0)
P
2

There are a number of situations where it would be useful to have a "memspread" function whose defined behavior was to copy the starting portion of a memory range throughout the whole thing. Although memset() does just fine if the goal is to spread a single byte value, there are times when e.g. one may want to fill an array of integers with the same value. On many processor implementations, copying a byte at a time from the source to the destination would be a pretty crummy way to implement it, but a well-designed function could yield good results. For example, start by seeing if the amount of data is less than 32 bytes or so; if so, just do a bytewise copy; otherwise check the source and destination alignment; if they are aligned, round the size down to the nearest word (if necessary), then copy the first word everywhere it goes, copy the next word everywhere it goes, etc.

I too have at times wished for a function that was specified to work as a bottom-up memcpy, intended for use with overlapping ranges. As to why there isn't a standard one, I guess nobody thought it important.

Professional answered 19/4, 2011 at 22:59 Comment(0)
V
1

memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).

Vanny answered 23/12, 2008 at 3:26 Comment(1)
Why should memcpy have that behaviour? On most hardware, I would be deeply disappointed with a memcpy so unoptimised that it in effect picks up and puts down one byte at a time, which is what this LDIR use relies on but which the C standard functions do not offer.Adduct
J
-1

As said before, memset() offers the desired functionality.

memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.

memmove() solves the case of buffers overlapping and dest > source.

On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):

lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:

Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.

If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.

Jaella answered 29/12, 2008 at 16:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.