Thank you very much for the new €Assembler version — it works like a charm!
I have one suggestion that I would like to discuss.
A long time ago, I worked with the MACRO‑11 assembler on a DEC PDP‑11 computer.
Manual:
https://bitsavers.org/pdf/dec/pdp11/rst ... _Oct87.pdf
In section 5‑2 (page 46), the addressing modes are listed, and one thing I am missing in Intel‑style assembler syntax is the **Autoincrement Mode**.
You can see how it is used on page 3‑11 (page 37), Figure 3‑1:
1$: CLR (R0)+
CMP R0, #IMPURT; Test if the end reached
BNE 1$
Here, `CLR (R0)+` clears a word at the address pointed to by `R0`, then increments `R0` (by 2, because it is a word). As a result, `R0` points to the next word on the next iteration. This way, we do not need a dedicated increment instruction—the increment is combined with `CLR`. As far as I know, this is directly supported by the DEC's CPU.Table 5‑3 shows that this works with any operation, such as `OPR R, (R)+`.
Perhaps you could add similar “syntax sugar” to your assembler as well?
For example, consider this code, which adds two byte arrays `a` and `b` of length `n` and writes the result to `c`:
EUROASM CPU=X64, SIMD=AVX2, AMD=ENABLED
AsmDLL64 PROGRAM FORMAT=DLL, MODEL=FLAT, WIDTH=64
EXPORT add_bytes_avx2
; void add_bytes_avx2(const uint8_t* a,
; const uint8_t* b,
; uint8_t* c,
; size_t n);
add_bytes_avx2 PROC
test r9, r9
jz done
; number of full 32-byte blocks
mov r10, r9
shr r10, 5 ; r10 = n / 32
jz tail
avx_loop:
vmovdqu ymm0, [rcx]
vmovdqu ymm1, [rdx]
vpaddb ymm0, ymm0, ymm1
vmovdqu [r8], ymm0
add rcx, 32 ; < next 32 bytes
add rdx, 32
add r8, 32
dec r10
jnz avx_loop
tail:
; remaining bytes
and r9, 31
jz done
tail_loop:
mov al, [rcx]
add al, [rdx]
mov [r8], al
inc rcx, rdx, r8
dec r9
jnz tail_loop
done:
vzeroupper ; important for ABI
ret
ENDP add_bytes_avx2
ENDPROGRAM AsmDLL64
My suggestion is to add post‑increment support, for example:
vmovdqu ymm0, [rcx]
add rcx, 32
; will be turned to
vmovdqu ymm0, [rcx]+ ; < auto increment
or
mov al, [rcx]
inc rcx
; will be written as
mov al, [rcx]+ ; < auto increment
With this feature, the full code above could look like this, shorter and elegant:
EUROASM CPU=X64, SIMD=AVX2, AMD=ENABLED
AsmDLL64 PROGRAM FORMAT=DLL, MODEL=FLAT, WIDTH=64
EXPORT add_bytes_avx2
; void add_bytes_avx2(const uint8_t* a,
; const uint8_t* b,
; uint8_t* c,
; size_t n);
add_bytes_avx2 PROC
test r9, r9
jz done
; number of full 32-byte blocks
mov r10, r9
shr r10, 5 ; r10 = n / 32
jz tail
avx_loop:
vmovdqu ymm0, [rcx]+
vmovdqu ymm1, [rdx]+
vpaddb ymm0, ymm0, ymm1
vmovdqu [r8]+, ymm0
dec r10
jnz avx_loop
tail:
; remaining bytes
and r9, 31
jz done
tail_loop:
mov al, [rcx]+
add al, [rdx]+
mov [r8]+, al
dec r9
jnz tail_loop
done:
vzeroupper ; important for ABI
ret
ENDP add_bytes_avx2
ENDPROGRAM AsmDLL64
This could also be extended to post‑ and pre‑increments and decrements:
mov al, [rcx]+ == mov al, [rcx] & inc rcx mov al, [rcx]- == mov al, [rcx] & dec rcx mov al, +[rcx] == inc rcx & mov al, [rcx] mov al, -[rcx] == dec rcx & mov al, [rcx]
The amount of increment or decrement would be inferred from the data size and context: 1 for byte, 2 for word, and so on, up to 64 for AVX‑512.
This could reduce code size and improve readability.
Just an idea — what do you think?
