Page 1 of 2

Invoke

Posted: 29 Aug 2018 22:11
by vitsoft
ar18 wrote: 29 Aug 2018 01:25 I am experienced having written a few full fledged 64-bit programs, like a Paint program written completely with GoAsm, so I'm very familiar with how it works and it works like this:
invoke fnTest,p1,p2,p3,p4,p5,,,p10

push p10 ; Put operands on stack, begin with the last.
.
.
.
push p5 ; = any value, whether GP, float, or double
mov r9,p4 ; for any GP value
___________; or movss xmm3,p4 for a float/real4/single precision
___________; or movsd xmm3,p4 for a double/real8
mov r8,p3 ; or movss xmm2,p3 or movsd xmm2,p3
mov rdx,p2 ; or movss xmm1,p2 or movsd xmm1,p2
mov rcx,p1 ; or movss xmm0,p1 or movsd xmm0,p1
sub rsp,8*4 ; always reserve shadow space (even if just one param being passed) unless leaf function
call fnTest
add rsp,8*(numberOfParams)
Gordon's invoke is faster, but what if I use scratch registers for passing operand values?
invoke fnTest,RDX,R8,RCX
mov r8,RCX ; Value of 2nd operand has just been overwritten!
mov rdx,R8 ; Value of 1st operand has just been overwritten!
mov rcx,RDX ; RDX no longer keeps the value of 1st operand.
sub rsp,8*4
call fnTest
add rsp,8*4
I'm certain you probably have the equivalent of "USES" for GP and SIMD registers, am I right? For saving non-volatile xmm registers, it first needs to be moved to a GP register, then pushed, or transferred directly to the stack with a movaps [rsp+?],xmm6.
Not yet. Invoke32 saves all GPRs with PUSHAD/POPAD but this is no longer possible in 64 bit mode, so programmer must manually save and restore those nonvolatile registers, which (s)he uses in their fnTest procedure. Not only GPR but SIMD, MMX, FPU... I'm not sure if its worth the complications...?

Re: Invoke

Posted: 30 Aug 2018 14:58
by ar18
vitsoft wrote: 29 Aug 2018 22:11
ar18 wrote: 29 Aug 2018 01:25 I am experienced, having written a few full fledged 64-bit programs like a Paint program written completely with GoAsm, so I'm very familiar with how it works and it works like this:
invoke fnTest,p1,p2,p3,p4,p5,,,p10

push p10 ; Put operands on stack, begin with the last.
.
.
.
push p5 ; = any value, whether GP, float, or double
mov r9,p4 ; for any GP value
___________; or movss xmm3,p4 for a float/real4/single precision
___________; or movsd xmm3,p4 for a double/real8
mov r8,p3 ; or movss xmm2,p3 or movsd xmm2,p3
mov rdx,p2 ; or movss xmm1,p2 or movsd xmm1,p2
mov rcx,p1 ; or movss xmm0,p1 or movsd xmm0,p1
sub rsp,8*4 ; always reserve shadow space (even if just one param being passed) unless leaf function
call fnTest
add rsp,8*(numberOfParams)
Gordon's invoke is faster,
Actually, that isn't Gordon's invoke, that is mine per the Win64 ABI spec. Gordon's invoke adds a 12 byte overhead for aligning the stack that isn't necessary if the stack was aligned at the beginning of every function.
vitsoft wrote: 29 Aug 2018 22:11 but what if I use scratch registers for passing operand values?

invoke fnTest,RDX,R8,RCX
mov r8,RCX ; Value of 2nd operand has just been overwritten!
mov rdx,R8 ; Value of 1st operand has just been overwritten!
mov rcx,RDX ; RDX no longer keeps the value of 1st operand.
sub rsp,8*4
call fnTest
add rsp,8*4
The Win64 ABI spec says that the first four parameters of a call must go into rcx,rdx,r8, and r9, therefore it is obvious that the first four parameters of your function call should not also be rcx,rdx,r8, and r9. Only beginner programmers, not knowing how Win64 works, would make that mistake, so it is not something I would worry about.

After the call is made to the function, per the Win64 ABI it must then copy rcx,rdx,r8, and r9 into the SHADOW SPACE at the start of that function. If you wanted to treat a function as a pseudo-fastcall, and not copy the registers into the SHADOW SPACE of your function, you could get away with that violation so long as you remember to not overwrite those four registers with any code. But the Win64 ABI says that the called procedure still must always move rcx,rdx,r8, and r9 into the SHADOW SPACE for debugging purposes, therefore if you do ignore that spec, just remember that the code will not work properly with Win64 ABI compliant debuggers like the one in Visual Studio.
vitsoft wrote: 29 Aug 2018 22:11
Not yet. Invoke32 saves all GPRs with PUSHAD/POPAD but this is no longer possible in 64 bit mode, so programmer must manually save and restore those nonvolatile registers, which (s)he uses in their fnTest procedure. Not only GPR but SIMD, MMX, FPU... I'm not sure if its worth the complications...?
If the only non-volatile register I use is RBX, I would much rather push one QWORD onto the stack instead of nine. Anything else will lead to exe bloat.

Re: Invoke

Posted: 30 Aug 2018 22:20
by vitsoft
ar18 wrote: 30 Aug 2018 14:58 After the call is made to the function, per the Win64 ABI it must then copy rcx,rdx,r8, and r9 into the SHADOW SPACE at the start of that function.
I didn't know that. Agner Fog in book 5, page 19, says ad 64 bit Windows: Since the shadow space is owned by the called function, it is safe to use these 32 bytes of shadow space for any purpose by the called function.
ar18 wrote: If the only non-volatile register I use is RBX, I would much rather push one QWORD onto the stack instead of nine. Anything else will lead to exe bloat.
Sure, and it costs only PUSH RBX and POP RBX, which is not much more bothering than adding something like USES=EBX to the operands of macro Procedure.
X64 yields enough scratch registers, so I don't think implementation of USES= is necessary. Some other problems to solve would arouse:
  1. Syntax of the list of stored registers. Perhaps something like StrCopy Procedure Src,Dest,Size, USES=RSI:RDI:RBX
  2. Propagation of the list from Procedure macro to EndProcedure.
  3. Parsing and reversing the array of registers at assembly time.
  4. Allow other than callee-save registers, too?
  5. What if user specifies nonpushable register, e.g. USES=BX:MMX7:YMM15
  6. Save only the lowest 64 bits of SIMD register?

Re: Invoke

Posted: 31 Aug 2018 17:41
by ar18
vitsoft wrote: 30 Aug 2018 22:20
ar18 wrote: 30 Aug 2018 14:58 After the call is made to the function, per the Win64 ABI it must then copy rcx,rdx,r8, and r9 into the SHADOW SPACE at the start of that function.
I didn't know that. Agner Fog in book 5, page 19, says ad 64 bit Windows: Since the shadow space is owned by the called function, it is safe to use these 32 bytes of shadow space for any purpose by the called function.
Why would I care what Agner Fog says, when the author of the Win64 ABI spec, Microsoft, says at https://msdn.microsoft.com/en-us/library/ms235286.aspx that "Space is allocated on the call stack as a shadow store for callees to save those registers. There is a strict one-to-one correspondence between the arguments to a function call and the registers used for those arguments"?

Can one ignore what Microsoft and the Win64 ABI says, and use the shadow space anyway you please as Agner Fog suggests, and not have any problems? Probably not, so long as you were consistent in how you save/address the values passed to your function in the rcx, rdx, r8, and r9 registers. But why have a shadow space to begin with? Why not ignore that too? Could you make it work? Certainly you could -- Agner Fog does it. I too don't have to follow any spec to make my program work, but I do need to follow the spec if I want my program to work with someone's else libraries or code. More importantly, if you want your program to work with modern debuggers, you would follow the spec and not Agner Fog's suggestion.

It is a very stupid thing to save parameters to a register, only to move them onto the stack. That wastes cycles and memory. The question then is, why ignore the ABI and waste that shadow space for something else, like Agner Fog suggests, when you could also ignore the ABI and not create a shadow space to begin with? You will be in big trouble if you don't save the first four parameters from the stack to somewhere, but where to save the registers parameters? How about to the shadow space as suggested by the ABI, and the only reason the shadow space even has a reason for existing? Instead let's just ignore the spec and slice and dice our program however we like?

Either save parameters to the registers (like a fastcall does) or move them to the stack (like cdecl does) but don't do both at the same time ... that is, unless you have a non-stupid reason for doing so. Does Microsoft have a good reason for doing this stupid thing? Yes. For one, they partially followed AMD's ABI spec. Two, ever notice how Win64 officially has no fastcall calling convention? They only have stdcall and vector. There is no fastcall, although you can specify it in C/C++, but the compiler will ignore it and replace it with stdcall. You can only completely ignore the shadow space if you use your function like an actual real life fastcall does (where parameters are only passed via registers and the stack is completely unused except for the return address). The Win64 stdcall is not a fastcall or a cdecl, it is a hybird between a fastcall and cdecl. That is the only reason why the shadow space exists.

The exception to this argument is when you pass 3 or less parameters to a function. Then I would say Agner Fog finally got it right because any unused reserved stack locations in the shadow space can be used as local storage and would affect nothing else, not even a debugger.
vitsoft wrote: 30 Aug 2018 22:20
ar18 wrote: If the only non-volatile register I use is RBX, I would much rather push one QWORD onto the stack instead of nine. Anything else will lead to exe bloat.
Sure, and it costs only PUSH RBX and POP RBX, which is not much more bothering than adding something like USES=EBX to the operands of macro Procedure.
Not exactly. A properly constructed USES=EBX would PUSH rbx at the beginning of my function, but anywhere there is a ret would replace it with a
pop rbx
ret

vitsoft wrote: 30 Aug 2018 22:20 X64 yields enough scratch registers, so I don't think implementation of USES= is necessary.
Here is a list of all the non-volatile registers in Win64: rbx, r12, r13, r14, r15, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, and xmm16. That isn't a small list and they are provided for your perusal, so why not support their usage?
vitsoft wrote: 30 Aug 2018 22:20 Some other problems to solve would arouse:
  1. Syntax of the list of stored registers. Perhaps something like StrCopy Procedure Src,Dest,Size, USES=RSI:RDI:RBX
  2. Propagation of the list from Procedure macro to EndProcedure.
  3. Parsing and reversing the array of registers at assembly time.
  4. Allow other than callee-save registers, too?
  5. What if user specifies nonpushable register, e.g. USES=BX:MMX7:YMM15
  6. Save only the lowest 64 bits of SIMD register?
  1. I wouldn't worry about syntax too much, just so long as you are consistent with your syntax usage elsewhere.
  2. Well they don't call it "work" for nothing.
  3. Ditto.
  4. It wouldn't be smart but it wouldn't hurt anything either.
  5. Flag it as an obvious error. SIMD registers cannot be pushed onto the stack.
  6. Save the lowest 8 bytes of only the XMM registers, per the ABI.

Re: Invoke

Posted: 31 Aug 2018 18:27
by ar18
ar18 wrote: 31 Aug 2018 17:41 It is a very stupid thing to save parameters to a register, only to move them onto the stack. That wastes cycles and memory...Either save parameters to the registers (like a fastcall does) or move them to the stack (like cdecl does) but don't do both at the same time
My 64-bit paint program has 86 built-in functions and 150 calls to external dlls. As you have pointed out, I can do anything I want with my code to make it work, but as I pointed out, you must conform to whatever standard exists for external calls. By eliminating 86 stdcall calling conventions, my program will probably be about 10k smaller and run 20% faster if I could use cdecl instead of stdcall. None of the assemblers I have used allow me to do this, that is, not if I want to use their directives and macros (hard to do with PROC directive). Those directives and macros save a lot of tedious typing and make the program so much easier to read, so I prefer using directives and macros as a convenience, just like every other sensible programmer does. I would like to suggest that for Win64 support in €ASM, could you provide a fastcall, cdecl, stdcall, and vector directive/macro for the calling convention, or at least make it possible so that a macro could be written by users to do this? Fastcall and cdecl is currently possible, but stdcall and vector are not...or tell me where the code for that functionality is located (to save me time looking for it) so I can write it myself if need be.

Re: Invoke

Posted: 31 Aug 2018 22:59
by vitsoft
ar18 wrote: 31 Aug 2018 18:27 I would like to suggest that for Win64 support in €ASM, could you provide a fastcall, cdecl, stdcall, and vector directive/macro for the calling convention, or at least make it possible so that a macro could be written by users to do this?
I think any convention is already possible at macro level. Macroinstructions in €ASM accept any piece of information as operands, problem is with Intel instruction (PUSH %1, MOV RCX,%1) which are fastidious, and we must parse %1.
Good news: attribute operator REGTYPE# already works, I hope to release it soon. Presence of SIMD in operand can be then checked with
%IF REGTYPE# %1 = 'X' ; If it is XMM register.
MOVD XMM0, %1
%ELSE
MOV RCX, %1
%ENDIF
With immediate float numbers it is more tedious, it would have to be checked on presence of decimal point:
%NrOfChars %SETS %1
%IsFloat %SETB false
i %FOR 1..%NrOfChars
%IF "%1[%i]" === "."
%IsFloat %SETB true
%ENDIF
%ENDFOR
%IF %IsFloat
MOVD XMM0, [=DD %1]
%ELSE
MOV RCX,%1
%ENDIF

You can clone already existing macros WinAPI or Invoke and modify them as you wish.

Re: Invoke

Posted: 01 Sep 2018 02:15
by ar18
vitsoft wrote: 31 Aug 2018 22:59
ar18 wrote: 31 Aug 2018 18:27 I would like to suggest that for Win64 support in €ASM, could you provide a fastcall, cdecl, stdcall, and vector directive/macro for the calling convention, or at least make it possible so that a macro could be written by users to do this?
I think any convention is already possible at macro level.
That's one of the things that makes your €ASM assembler such a great idea. Nothing is hidden. Everything is documented extremely well. You can easily customize it to your heart's content. It is very up-to-date. Macro scripting is comprehensive and powerful. It just doesn't properly support the Win64 ABI ... yet.
vitsoft wrote: 31 Aug 2018 22:59 Macroinstructions in €ASM accept any piece of information as operands, problem is with Intel instruction (PUSH %1, MOV RCX,%1) which are fastidious, and we must parse %1.
Yes, that and make sure the order of operations are done in the proper places of the program.
vitsoft wrote: 31 Aug 2018 22:59 Good news: attribute operator REGTYPE# already works, I hope to release it soon. Presence of SIMD in operand can be then checked with

%IF REGTYPE# %1 = 'X' ; If it is XMM register.
MOVD XMM0, %1
%ELSE
MOV RCX, %1
%ENDIF
Great!
vitsoft wrote: 31 Aug 2018 22:59 With immediate float numbers it is more tedious, it would have to be checked on presence of decimal point:

%NrOfChars %SETS %1
%IsFloat %SETB false
i %FOR 1..%NrOfChars
%IF "%1[%i]" === "."
%IsFloat %SETB true
%ENDIF
%ENDFOR
%IF %IsFloat
MOVD XMM0, [=DD %1]
%ELSE
MOV RCX,%1
%ENDIF
I told you not to worry about that. The Intel processor doesn't support that so why should you? Heck the AMD processor doesn't even support 64-bit imm integers for some instructions that use 64-bit operands. With that said, let me tell you what I suggested to Jeremy for GoASM:

I thought the following would be far more helpful than anything I've said so far:

;First four params:
sgl reg______movaps xmm,xmm
dbl reg______movapd xmm,xmm
sgl mem______movss xmm,d[val]
dbl mem______movsd xmm,q[val]

all else handled by mov rcx - r9, reg/imm/mem

All subsequent params:
sgl reg______movd eax,xmm --> push rax
dbl reg______movq rax,xmm --> push rax
sgl mem______pushq [val]
dbl mem______pushq [val]

all else handled by push 64-bit reg/imm/mem

where sgl = single-precision
______dbl = double-precision
______mem = memory
______imm = immediate
______int = 32-bit integer
______long = 64-bit integer

All other combinations should be handled by the programmer and not the compiler.

;First four params can contain:
int imm______mov eax,imm -> cvtsi2ss xmm,eax
long imm_____mov rax,imm -> cvtsi2sd xmm,rax
sgl imm______mov eax,imm -> movd xmm,eax
dbl imm______mov rax,imm -> movq xmm,rax

;All subsequent params:
int imm______mov eax,imm -> push rax
long imm_____mov rax,imm -> push rax ;beware of possibility of Intel to "accidentally" sign extend certain values.
sgl imm______mov eax,imm -> push rax
dbl imm______mov rax,imm -> push rax ;see note for long imm above


I don't like any of that for the immediates. It is much more simpler (and faster and less bytes) to go like this:

dblVal D real8 1.234e5
invoke fnTest,dblVal,..


Which would then become movq xmm0,dblVal
vitsoft wrote: 31 Aug 2018 22:59 You can clone already existing macros WinAPI or Invoke and modify them as you wish.
Have done that already but am waiting on the capability for €ASM to support real4 and real8 data types so I can test for their presence before moving them to registers (for invoke) or to the stack (for the shadow space).

Re: Invoke

Posted: 04 Sep 2018 02:11
by ar18
From the Win64 ABI: "Even if the called function has fewer than 4 parameters, these 4 stack locations are effectively owned by the called function, and may be used by the called function for other purposes besides saving parameter register values. Thus the caller may not save information in this region of stack across a function call". RE: https://docs.microsoft.com/en-us/cpp/bu ... ew=vs-2017.

Re: Invoke

Posted: 04 Sep 2018 16:45
by vitsoft
Yes, callee is the owner of shadow space (ShSp), from the moment when it's been called.
The red quotation says that I (in the role of WinABI caller), having just allocated ShSp, cannot save anything to it, then call the function and expect that the saved information survived intact in ShSp across the call. Nobody argues about that.

It's irrelevant whether I allocate the ShSp with SUB RSP,32 or with PUSH PUSH PUSH PUSH.
Micosoft (when their function is called) expects undefined garbage in ShSp, and populates it with the first four
arguments. They do not care if my implementation of Invoke has already overwritten the expected garbage
with values of first four arguments, they will do it anyway. No conflict on the scene.

Re: Invoke

Posted: 05 Sep 2018 22:59
by ar18
vitsoft wrote: 04 Sep 2018 16:45 Yes, callee is the owner of shadow space (ShSp), from the moment when it's been called.
The red quotation says that I (in the role of WinABI caller), having just allocated ShSp, cannot save anything to it, then call the function and expect that the saved information survived intact in ShSp across the call. Nobody argues about that.

It's irrelevant whether I allocate the ShSp with SUB RSP,32 or with PUSH PUSH PUSH PUSH.
Micosoft (when their function is called) expects undefined garbage in ShSp, and populates it with the first four
arguments. They do not care if my implementation of Invoke has already overwritten the expected garbage
with values of first four arguments, they will do it anyway. No conflict on the scene.
sub rsp,32 only takes up four bytes, whereas four memory pushes take up 28 bytes. All those unnecessary pushes are called do nothing code and exe bloat. They are do nothing because it is a repeat of something done somewhere else and it is called bloat because each invoke adds 24 bytes and all for nothing gained and nothing lost.

And nothing is expected in the shadow space after invoke. It is reserved for the caller and is by definition uninitialized data until the procedure saves (register) values to it.