Where can the code be more efficient for checking if an input character is a vowel?
Asked Answered
C

1

4

This assembly project reads the key presses and outputs them in a specific color. When a vowel is pressed it changes the color of the text until another vowel is pressed and does so until ESC is pressed. The colors are in a certain pattern which is why I sub colorCode, 8 when it reaches the end of the cycle. I am just looking to make it more efficient. I tried making all the compare statements into one line but wasn't successful.

INCLUDE        Macros.inc
INCLUDE     Irvine32.inc
INCLUDELIB  Irvine32.lib
.386
.STACK 4096
ExitProcess PROTO, dwExitCode:DWORD

.DATA
key       BYTE ?     
colorCode BYTE 5
max       BYTE 13

.CODE
main PROC

FindKey:
mov EAX, 50
call Delay

call ReadKey 
jz FindKey

MOV key, AL 
     cmp key, 75h
     JE UP
     CMP key, 6Fh
     JE UP
     CMP key, 69h
     JE UP
     CMP key, 65h
     JE UP
     CMP key, 61h
     JE UP
     CMP key, 55h
     JE UP
     CMP key, 4Fh
     JE UP
     CMP key, 49h
     JE UP
     CMP key, 45h
     JE UP
     CMP key, 41h
     JE UP
     CMP dx,VK_ESCAPE
     JE OVER

     COLOR:   
          MOVZX EAX, (black * 16) + colorCode
          CALL SetTextColor 
          MOV AL, key
          call WriteChar
          jmp FindKey

          UP: 
               CMP colorCode, 13
               JE RESET
               INC colorCode
               jmp COLOR
               
               RESET:
                    sub colorCode, 8
                    jmp COLOR    
               
     OVER:
     CALL Crlf
     INVOKE ExitProcess, 0
     
main ENDP
END main
Chancemedley answered 21/3, 2016 at 1:53 Comment(6)
Depends on your definition of "efficient" and other circumstances. For example, you could use a lookup table which gives you whether it's a vowel or not. That would make code shorter, but less predictable and would of course incur memory and cache overhead. You could also use SSE comparison. Turning the letters to either upper or lower case with bitwise operation first would also reduce your comparisons.Zetes
What is SSE comparison?Chancemedley
Vectorized comparison using SSE instructions. In this case, PCMPEQB and PTEST probably.Zetes
I haven't heard of those in my class yet.. Is that using MASM and win32?Chancemedley
@Chancemedley - SSE is an x86 instruction set extension that allows 128-bit parallel operations. And yes, MASM and Windows support SSE instructions, and just about any modern processor will too. Jester means you could reduce the number of comparisons by executing them in parallel.Gamogenesis
@Jester: PTEST isn't usually useful on the output of a PCMP, because it's 2 uops and can't macro-fuse. pmovmskb / test/jcc is better in theory, according to Agner Fog's tables, but I haven't tested. In this case though, an integer immediate bitmap is probably the best way to implement the vowel test, because a 32bit bitmap has room for an entry for every letter.Moonlit
M
6

If you're interested in efficient x86 code, see the links in the tag wiki. There's a lot of good stuff, esp. Agner Fog's guides.


You have key in AL, but your cmp instructions all use a memory operand. There's a special opcode for cmp al, imm8, so cmp al, 75h is only a 2 byte instruction. Using an absolute displacement to address key makes a much longer instruction. Also, cmp mem,imm can't macro-fuse with a conditional jump. And every insn needs the load port.

The rest of your code looks suspiciously like it uses memory operands too much, and is indented strangely. (UP looks like it's part of the COLOR block, but actually there's an unconditional jump at the end of COLOR, so it doesn't fall into UP.)


Of course, a long series of cmp/je is nowhere near optimal, since all the je targets are the same. You don't need to figure out which key actually matched.

One strategy you can use for a check like that is to see if al is in the right range, then use it as an index into a bitmap.

Compilers use this strategy (Godbolt compiler explorer) for a switch or multi-condition if like this. This is why we use compilers instead of manually writing asm most of the time: they know lots of clever tricks and can apply them where applicable. We get 1<<c for the switch, but the if actually compiles to a bt with GCC. (GCC9 has a regression where the switch compiles to a jump table, though.)

See my answer on another ASCII question for an explanation of the unsigned-compare trick (ja .non_alphabetic) and an example of an efficient loop.

    MOV   [key], AL    ; store for later use

    or    al,  20h     ; lowercase (assuming an alphabetic character)
    sub   al, 'a'      ; turn the ascii encoding into an index into the alphabet
    cmp   al, 'z'
    ja  .non_alphabetic

    mov   ecx, (1<<('a'-'a')) | (1<<('e'-a')) | (1<<('i'-a')) | (1<<('o'-a')) | (1<<('u'-a'))   ; might be good to pull this constant out and use an EQU to define it
    ; movzx eax, al    ; unneeded except for possible performance issues on old Intel CPUs (P6 family partial-register stuff).
    bt    ecx, eax      ; test for the letter being set in the bitmap
    jc  UP              ; jump iff al was a vowel
.non_alphabetic:
    CMP dx,VK_ESCAPE    ; this test could be first.
    JE OVER

Or if you want to count vowels, use adc edx, 0 or something to add CF to a register, instead of branching.

(bt masks its input, only using the low bits as the "shift count" so you don't really need movzx. But if you do need to avoid partial-register stalls on old Intel CPUs (before Sandybridge), use movzx edx, al instead of movzx eax, al. That will hurt performance less on more recent Intel CPUs: mov-elimination only works with different registers. But it still costs an extra uop for the front-end.)

This is significantly fewer instructions, and far fewer branches, so it uses up fewer branch-predictor entries.

Don't keep the constant in memory for bt: bt mem,reg is slow because of crazy-CISC semantics where it can access a different address if the bit index is higher than the operand-size. It only masks the bit-index when bt is used with a register first operand.

An alternative to bt is to do if(mask & 1 << (key - 'a')):

    movzx ecx, al      ; avoid partial-reg stall or false dep on ecx that you could get with mov ecx,eax or mov cl,ca respectively
    mov   eax, 1
    shl   eax, cl      ; eax has a single set bit, at the index
    test  eax, 1<<('a'-'a') | 1<<('e'-a') | 1<<('i'-a') | 1<<('o'-a') | 1<<('u'-a')
    jnz  .vowel

This is more uops, even though test/jnz can macro-fuse, because variable-count shifts are 3 uops on Intel Sandbridge-family CPUs. (Again, crazy-CISC semantics slow things down).

Or right-shift the mask instead of creating 1<<c. You can even arrange to skip a test al,1 by having your mask right-shifted by 1 bit already, so the bit you want to branch on is shifted into CF by the shr. But on Nehalem and earlier, reading the flag-result of a variable count shift stalls the front-end until the shift retires from the back-end, and on SnB-family it's still 3 uops for a variable-count shift.


Since comments are discussing SSE:

    ; broadcast the key to all positions of an xmm vector, and do a packed-compare against a constant
    ; assuming  AL is already zero-extended into EAX
    imul    eax, eax, 0x01010101    ; broadcast AL to EAX
    movd    xmm0, eax
    pshufd  xmm0, xmm0, 0    ; broadcast the low 32b element to all four 32b elements
    pcmpeqb xmm0, [vowels]   ; byte elements where key matches the mask are set to -1, others to 0
    pmovmskb eax, xmm0
    test    eax,eax
    jnz   .vowel


section .rodata:
  align 16
  vowels: db 'a','A', 'e','E'
          db 'i','I', 'o','O'
          db 'u','U', 'a','a'
    times 4 db 'a'            ; filler out to 16 bytes avoiding false-positives

A byte broadcast (SSSE3 pshufb or AVX2 vpbroadcastb) instead of a dword broadcast (pshufd) would avoid the imul. Or use or eax,0x20 before broadcasting so we don't need upper and lower case versions of every vowel, just lowercase. Then we could just broadcast with movd + punpcklbw + pshufd or something like that.

This requires loading a constant from memory, rather than a 32bit bitmap that can efficiently be an immediate in the instruction stream, so this is probably not as good even though it only has one branch. (Remember that the bitmap version needs to branch on non-alphabetic, and then on being a vowel).

Moonlit answered 21/3, 2016 at 3:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.