EuroAssembler Index Manual Download Source Macros


Sitemap Links Forum Tests Projects

String32.htm
Macros
Compare$
Concat$
DecodeUTF8
EncodeUTF8
GetLength$

This file can be included to 32bit programs written in Euro Assembler.
It contains OS-independent macros for operations with zero-terminated (ASCIIZ) strings in ANSI or WIDE (Unicode) encoding.
Unicode string must always be word aligned and terminated with the zero UNICHAR (word).

Macros may crash the process when the input string is not properly zero-terminated and the following memory is not available for reading.

All functions expect zeroed direction flag on input and they do not change it.

ANSI or WIDE functionality is selected by the current EUROASM UNICODE= boolean option. Its value is available in system variable %^UNICODE.

Similar macros with identical names for different program width are defined in string16.htm and string64.htm.


 string32 HEAD
↑ GetLength$ String, Unicode=%^UNICODE
This macro returns the size of the string in bytes.
Input
String is pointer to a zero terminated string of ANSI or WIDE characters. It may also be a literal string.
Unicode= is a logical parameter which tells whether the strings are ANSI or WIDE. By default (if not specified) it is the same as the system preprocessing variable %^UNICODE at invocation time.
Output
ECX= size of the string without the terminating NUL in bytes. The result is even for WIDE strings.
Example
SomeText DU "Hi!",0 ; SomeText was defined as WIDE characters, regardless of current UNICODE status. ; SIZE# SomeText is 8, memory contains 4800_6900_2100_0000h EUROASM UNICODE=YES GetLength$ SomeText ; ECX is now 6 (3 nonzero UNICHARS). EUROASM UNICODE=NO GetLength$ SomeText ; ECX is now 1 (1 nonzero BYTE).
GetLength$ %MACRO String, Unicode=%^UNICODE
     PUSHD %String
     %IF %^Unicode
       CALL GetLength$W@RT::
       GetLength$W@RT:: PROC1
         PUSHD EAX,EDI
           SUB EAX,EAX
           XOR ECX,ECX
           MOV EDI,[ESP+12] ; Pointer to String$.
           DEC ECX
           REPNE SCASW
           NOT ECX
           DEC ECX
           SHL ECX,1
         POP EDI,EAX
         RET 4
       ENDPROC1 GetLength$W@RT::
     %ELSE ; Not %Unicode.
      CALL GetLength$A@RT::
       GetLength$A@RT:: PROC1
          PUSHD EAX,EDI
            XOR EAX,EAX
            XOR ECX,ECX
            MOV EDI,[ESP+12] ; Pointer to String$.
            DEC ECX
            REPNE SCASB
            NOT ECX
            DEC ECX
         POP EDI,EAX
         RET 4
       ENDPROC1 GetLength$A@RT::
     %ENDIF
    %ENDMACRO GetLength$
 
↑ Concat$ Destination, Size=, Source1, Source2,,, Unicode=%^UNICODE
Macro will concatenate one or more zero-terminated source strings (ANSI or WIDE) to a destination string.
Input
Destination is a pointer to memory where the result of concatenation will be stored as zero-terminated string.
Size= is the size in bytes allocated for the output destination buffer including the zero terminator. By default it is set to SIZE# %Destination.
Source* operands are pointers to the strings which are to be concatenated. The first one (Source1) may be identical with the destination, when we need to append something to an existing string.
Unicode= is a logical parameter which tells whether the strings are ANSI or WIDE. By default (if not specified) it is the same as the system preprocessing variable %^UNICODE at invocation time.
Unicode should be FALSE when the concatenated string are in UTF-8.
Output
CF=0, Destination is filled with concatenation, all registers are preserved.
CF=1 when the Size= is not long enough. The output buffer Size is never exceeded.
Example
Concat$ FullName$,Path$,FileName$,=".htm"
Concat$ %MACRO Destination, Source1,Source2,,, Size=, Unicode=%^UNICODE
   %IF %# < 2                                                                                  ; >>
     %ERROR ID=5930, 'Missing operand of macro "Concat$".'
     %EXITMACRO Concat$
   %ENDIF
   PUSH EBP      ; Variable number of arguments uses a special stack frame.
     MOV EBP,ESP ; Store stack pointer.
     ArgNr %FOR %#..2, STEP= -1
       PUSHD %*{%ArgNr} ; All Source pointers, starting with the last.
     %ENDFOR ArgNr
     PUSHD %# - 1 ; Number of Source strings to concatenate.
     %IF "%Size" === ""
       PUSHD SIZE# %Destination
     %ELSE
       PUSHD %Size
     %ENDIF
     PUSHD %Destination
     %IF %Unicode
       CALL Concat$W@RT::
Concat$W@RT:: PROC1
       PUSHAD
         MOV EBP,ESP
         MOV EDI,[EBP+36]    ; %Destination.
         MOV EDX,[EBP+40]    ; %Size.
         MOV ECX,[EBP+44]    ; Number of Source strings.
         LEA EDX,[EDI+EDX-2] ; End of allocated Destination.
         XOR EAX,EAX
    .20: MOV ESI,[EBP+48]    ; Pointer to %Source.
    .30: LODSW
         TEST EAX            ; Check if it's end of source string.
         JZ .40:
         CMP EDI,EDX         ; Check if it's end of destination string.
         CMC
         JC .80:             ; If destination size overflowed.
         STOSW
         JMP .30:
    .40: ADD EBP,4           ; The next Source pointer on stack frame.
         LOOP .20:
    .80: MOV AX,0            ; Finally zero-terminate the destination.
         STOSW
       POPAD
       RET                   ; CF=overflow
      ENDPROC1 Concat$W@RT::
     %ELSE                   ; If not %Unicode.
       CALL Concat$A@RT::
Concat$A@RT:: PROC1
       PUSHAD
         MOV EBP,ESP
         MOV EDI,[EBP+36]    ; %Destination.
         MOV EDX,[EBP+40]    ; %Size.
         MOV ECX,[EBP+44]    ; Number of Source strings.
         LEA EDX,[EDI+EDX-1] ; End of allocated Destination.
         XOR EAX,EAX
    .20: MOV ESI,[EBP+48]    ; Pointer to %Source.
    .30: LODSB
         TEST EAX            ; Check if it's end of source string.
         JZ .40:
         CMP EDI,EDX         ; Check if it's end of destination string.
         CMC
         JC .80:             ; If destination size overflowed.
         STOSB
         JMP .30:
    .40: ADD EBP,4           ; The next Source pointer on stack frame.
         LOOP .20:
    .80: MOV AL,0            ; Finally zero-terminate the destination.
         STOSB
       POPAD
       RET                   ; CF=overflow
      ENDPROC1 Concat$A@RT::
     %ENDIF
     MOV ESP,EBP             ; Restore the stack.
   POP EBP
  %ENDMACRO Concat$
↑ Compare$ String1, String2, Unicode=%^UNICODE
Compare two zero-terminated ANSI or WIDE strings.
Input
String1 is pointer to the first ANSI or WIDE zero-terminated strings. ESI is assumed when the first operand is omitted.
String2 is pointer to the second ANSI or WIDE zero-terminated strings. EDI is assumed when the second operand is omitted.
Unicode= is a logical parameter which tells whether the strings are ANSI or WIDE. By default (if not specified) it is the same as the system preprocessing variable %^UNICODE at invocation time.
In non-flat memory model DS=ES are assumed to specify segment of both strings.
Output
ZF=1 if both string are identical,
ZF=0 otherwise.
Compare$ %MACRO String1, String2, Unicode=%^UNICODE
   %IF "%String2" === ""
      PUSHD EDI
   %ELSE
      PUSHD %String2
   %ENDIF
   %IF "%String1" === ""
      PUSHD ESI
   %ELSE
      PUSHD %String1
   %ENDIF
   %IF %Unicode
     CALL Compare$W@RT::
Compare$W@RT:: PROC1
       PUSHAD
         MOV EBP,ESP
         SUB EAX,EAX
         SUB ECX,ECX
         MOV EDI,[EBP+40] ; %String2.
         DEC ECX
         MOV EBX,EDI
         REPNE:SCASW      ; Search for the terminator.
         SUB EDI,EBX      ; Size of String2 in bytes including the UNICHAR NUL.
         MOV EDX,EDI
         MOV EDI,[EBP+36] ; %String1.
         MOV ESI,EDI
         REPNE:SCASW      ; Search for the terminator.
         MOV ECX,EDI
         SUB ECX,ESI      ; Size of %String1 in bytes including the UNICHAR NUL.
         CMP ECX,EDX      ; Compare string sizes.
         JNE .90          ; If sizes do not match.
         MOV EDI,EBX      ; String2.
         REPE CMPSB
   .90:POPAD
       RET 2*4
      ENDPROC1 Compare$W@RT::
     %ELSE                ; If not %Unicode.
       CALL Compare$A@RT::
Compare$A@RT:: PROC1
       PUSHAD
         MOV EBP,ESP
         SUB EAX,EAX
         SUB ECX,ECX
         MOV EDI,[EBP+40] ; %String2.
         DEC ECX
         MOV EBX,EDI
         REPNE:SCASB      ; Search for the terminator.
         SUB EDI,EBX      ; Size of String2 in bytes including the NUL.
         MOV EDX,EDI
         MOV EDI,[EBP+36] ; %String1.
         MOV ESI,EDI
         REPNE:SCASB      ; Search for the terminator.
         MOV ECX,EDI
         SUB ECX,ESI      ; Size of %String1 in bytes including the NUL.
         CMP EDX,ECX      ; Compare string sizes.
         JNE .90          ; If sizes do not match.
         MOV EDI,EBX      ; String2.
         REPE CMPSB
   .90:POPAD
       RET 2*4
      ENDPROC1 Compare$A@RT::
     %ENDIF
  %ENDMACRO Compare$
DecodeUTF8 Source, CallbackProc, Size=-1, Width=16

Macro DecodeUTF8 converts Source UTF-8 string to UTF-16 or UTF-32 string.
Source string is either zero-terminated, or its Size= must be specified. Conversion stops at NUL byte, which is not converted to output. Input never reads beyond Source+Size.

If Byte Order Mark (BOM, 0xEF,0xBB,0xBF ) is detected at the beginning of the Source string, it is ignored.

Invalid UTF-8 sequence will send a replacement character 0xFFFD to the output.

Byte order in output encoding is always LittleEndian, the same which is used in MS Windows WIDE functions.

If you want to produce UTF-16BE, perform XCHG AL,AH in CallbackProc.
If you want to produce UTF-32BE, perform BSWAP EAX in CallbackProc.
If you want to prefix the output string with BOM, store it to destination buffer before invoking DecodeUTF8.
If you don't like replacement characters (usually displayed as little squares ), filter them out in CallbackProc.
Documented
[UTF8], [UTF16], [UTF32]
Input
Source is pointer to the first byte of UTF8-encoded string.
Size= -1 is the maximal possible size of source string in bytes. It may be left on default when the Source string is terminated with NUL byte (this NUL is not written to output).
If Size is not -1, exactly that many input bytes are decoded, including NUL bytes.
Width=16 or 32 specifies the output encoding UTF-16 or UTF-32, respectively.
CallbackProc is pointer to the procedure which stores one converted character.
CallbackProc
is called with register calling convention.
It is expected to store the UTF-16 or UTF-32 character obtained in EAX and return with CF=0.
Input
CF=DF=0
EAX= one converted character encoded in UTF-16 or UTF-32.
EAX may contain surrogate code 0x0000_D800..0x0000_DFFF when the input UTF-8 character belongs to Unicode supplementary planes (Emoji, Asian characters etc).
EAX may also contain the replacement 0x0000_FFFD when the input UTF-8 string is malformed.
EBP= original value of EBP on input to the macro. Usually it is the frame pointer of the function which expanded DecodeUTF8, thus arguments and local variables of the function can be used in CallbackProc.
EDI= original value of EDI on input to the macro. It may be used in STOS, incremented value of EDI will be supplied on the next invocation of CallbackProc.
EBX,ECX,EDX,ESI should be considered undefined.
Output
CF=0 when a character from EAX was successfully stored to destination buffer by CallbackProc and macro should continue with parsing the next UTF-8 characters.
CF=1 signalizes that the macro should cancel further conversion. CF propagates to the output of DecodeUTF8.
EAX,EBX,ECX,EDX,ESI,EDI may be changed in Callback procedure. The value of
EDI will be saved and provided in the next call of CallbackProc, thus EDI can be used in CallbackProc as an output pointer for decoded data.
Output
CF=0, ECX=number of unprocessed bytes at the end of text (0..3) due to incompleteness of the last UTF-8 character in input text block. The caller of DecodeUTF8 should seek the input file by ECX bytes back before reading the next block of text.
EDI= as returned from CallbackProc.
All other registers are preserved.
Example
MOV EDI,DestString ; It should be long enough for the decoded string. DecodeUTF8 SourceString, Store: Store:PROC1 ; Thanks to using PROC1 instead of PROC it doesn't need bypass by JMP. STOSW ; Store UTF-16 character from AX and advance EDI to the next free room. RET ; Return to DecodeUTF-8 macro with CF=0. ENDPROC1 Store: SUB EAX,EAX STOSW ; Zero-terminate DestString. Now it can be used in TextOutW, MessageBoxW etc.
DecodeUTF8 %MACRO Source, CallbackProc, Size=-1, Width=16
    %IF %Width != 16 && %Width != 32
      %ERROR ID=5932,'Macro "DecodeUTF8" requires Width=16 or Width=32.'
      %EXITMACRO DecodeUTF8
    %ENDIF
    PUSHD %Width, %Size, %CallbackProc, %Source
    CALL DecodeUTF8@RT::
DecodeUTF8@RT:: PROC1
    PUSHAD
      SUB ECX,ECX
      MOV [ESP+24],ECX  ; Initialize %ReturnECX to 0.
      MOV EDI,[ESP+36]  ; %Source.
      MOV ECX,[ESP+44]  ; %Size.
      MOV ESI,EDI
      MOV EAX,ECX
      INC EAX
      JZ .Scan:         ; If Size=-1, EAX=0 and the Source size will be scanned.
      LEA EDI,[ESI+ECX] ; Otherwise use the explicit %Size.
      JMP .No0:
.Scan:REPNE:SCASB
      JNE .No0:
      DEC EDI          ; Omit the terminator from conversion.
.No0: ; Source string without NUL is now at ESI..EDI.
 BOM  %FOR 0xEF,0xBB,0xBF ; Little-Endian BOM (0xFEFF) encoded in UTF-8.
        CMP ESI,EDI
        JNB .NoBOM:
        LODSB
        CMP AL,%BOM
        JNE .NoBOM:
      %ENDFOR BOM
      JMP .Start:       ; BOM was detected, ESI is advanced just behind it.
.NoBOM:MOV ESI,[ESP+36] ; No BOM detected, restore ESI to the start of Source again.
.Start:CMP ESI,EDI
      JNB .End:
      XOR EBX,EBX
      LODSB
      MOV BL,AL
      NOT BL
      BSR ECX,EBX ; Scan bits 7..0 of inverted first byte of 1,2,3,4 bytes long UTF-8 character.
      MOV BL,AL   ; First byte of 1,2,3,4 bytes long UTF-8 character (not inverted).
      MOV DL,0x7F ; Prepare mask for payload bits in the 1st UTF-8 byte.
      SUB ECX,7   ; ECX=7,5,4,3 change to ECX=0,-2,-3,-4.
      JZ .Out:    ; When EBX is codepoint 0..0x7F (7bit ASCII character).
      NEG ECX     ; ECX=2,3,4 (number of bytes in UTF-8 character).
      SHR DL,CL   ; DL=0x1F,0x0F,0x07 is payload mask.
      AND BL,DL   ; EBX will accumulate payload bits of codepoint.
      CMP CL,2
      JB .Bad:
      CMP CL,4
      JBE .Good:
.Bad: MOV EAX,0xFFFD ; Invalid UTF-8 detected, output the replacement.
      JMP .NoSg:
.Good:DEC ECX     ; ECX=1, 2 or 3 continuation bytes expected.
      LEA EAX,[ESI+ECX]
      CMP EAX,EDI ; Check if there's that many input bytes left.
      JBE .Cont:
      DEC ESI     ; Rollback, the last UTF-8 character is incomplete.
      SUB EDI,ESI ; EDI characters (1..3) were not decoded.
      MOV [ESP+24],EDI ; %ReturnECX.
      JMP .End:   ; CF=0.

.OutEAX: ; PROC1             ; Send EAX to callback. Preserves EBX,ESI,EDI, updates ReturnEDI.
          PUSH EBX,ESI,EDI
            MOV EDI,[ESP+16]  ; ReturnEDI restore.
            CALL [ESP+56]     ; CallbackProc.
            MOV [ESP+16],EDI  ; ReturnEDI update.
          POP EDI,ESI,EBX
          RET
       ; ENDPROC1 .OutEAX:

.Cont:LODSB       ; Continuation byte AL=10xxxxxxb expected.
      BTR EAX,7   ; Reset the marker bit 7.
      JNC .Bad:
      BTR EAX,6
      JC .Bad:
      SHL EBX,6   ; Make room in EBX for the next 6 bits.
      OR  BL,AL   ; Accumulate them.
      DEC ECX
      JNZ .Cont:
.Out: MOV EAX,EBX ; EAX=EBX is now the decoded codepoint 0..0x10_FFFF.
      ; Check for overlong encodings. DL=0x7F,0x1F,0x0F,0x07 for 1,2,3,4 bytes in UTF-8 character.
      CMP EBX,0x01_0000   ; Codepoint 0x01_0000..0x10_FFFF should be encoded in 4 bytes.
      JAE .NoOverlong:
      CMP EBX,0x00_0800   ; Codepoint 0x00_0800..0x00_FFFF should be encoded in 3 bytes.
      JB .2Bts:
      CMP DL,0x0F
      JE  .NoOverlong:
      JMP .Bad:
.2Bts:CMP EBX,0x00_0080   ; Codepoint 0x00_0080..0x00_07FF should be encoded in 2 bytes.
      JB .1Bts:
      CMP DL,0x1F
      JE .NoOverlong:
      JMP .Bad:
.1Bts:CMP DL,0x7F         ; Codepoint 0x00_0000..0x00_007F should be encoded in 1 byte.
      JE .NoOverlong:
      TEST EBX
      JNZ .Bad:
      CMP DL,0x1F         ; Exception: codepoint 0 may be encoded in 1 or 2 bytes.
      JNE .Bad:
.NoOverlong:
      SHR EBX,11          ; Check for surrogate codepoints.
      CMP BL,0x1B
      JE .Bad:            ; Do not accept surrogates 0xD800..0xDFFF from input.
      TEST BX,0x3E0
      JZ .NoSg:           ; If codepoint EAX is below 0x0001_0000, surrogates do not apply.
      CMPD [ESP+48],16    ; Output UTF %Width (16 or 32).
      JNE .NoSg:          ; UTF-32 does not need surrogates.
      SUB EAX,0x0001_0000 ; Codepoint EAX was not encodable in one UTF-16 character.
      MOV EBX,0x0000_03FF ; Use two surrogate Unichars.
      AND EBX,EAX
      SHR EAX,10
      ADD EBX,0x0000_DC00 ; EBX is now low surrogate.
      ADD EAX,0x0000_D800 ; EAX is now high surrogate.
      CALL .OutEAX:       ; High surrogate first.
      MOV EAX,EBX         ; Low surrogate.
      JC .End:            ; If aborted by CallbackProc.
.NoSg:CALL .OutEAX:       ; Low surrogate or BMP codepoint or UTF-32.
      JNC .Start:         ; Parse the next UTF-8 character from the string ESI..EDI.

.End:POPAD
     RET 4*4
   ENDP1 DecodeUTF8@RT::
 %ENDMACRO DecodeUTF8
EncodeUTF8

Macro EncodeUTF8 converts a codepoint to an UTF-8 character and stores it into the string at EDI.

Documented
[UTF8]
Input
EAX is the converted codepoint (ordinal number of the character in UNICODE chart) in the range 0..0x1F_FFFF.
DF=0, EDI is pointer to the output string. It will be incremented with each stored byte.
Output
DF=0, EDI= increased by the number of stored bytes (1..4).
All other registers are preserved.
Example
MOV EDI,DestStringUtf8 ; It should be long enough for the decoded string. MOV ESI,InputStringUtf16 Again: MOVZXW EAX,[ESI] CMP AX,0 JZ End EncodeUTF8 JMP Again:
EncodeUTF8 %MACRO
     CALL EncodeUTF8@RT:
EncodeUTF8@RT: PROC1
     CMP EAX,0x0000_0080
     JAE .10:
     STOSB
     RET
.10: PUSH EAX,ECX
      MOV ECX,EAX
      CMP EAX,0x0000_0800
      JAE .30:
      SHR EAX,6
      OR AL,0xC0
.20:  STOSB
      MOV EAX,ECX
      AND AL,0xBF
      OR AL,0x80
      STOSB
     POP ECX,EAX
     RET
.30:  CMP EAX,0x0001_0000
      JAE .40:
      XCHG AL,AH
      SHR AL,4
      OR AL,0xE0
      STOSB
      MOV EAX,ECX
      SHL EAX,2
      XCHG AL,AH
      AND AL,0xBF
      OR  AL,0x80
      JMP .20:
.40:  SHR EAX,18
      AND AL,0xF7
      OR AL,0xF0
      STOSB
      MOV EAX,ECX
      SHR EAX,12
      AND AL,0xBF
      OR AL,0x80
      STOSB
      MOV EAX,ECX
      SHR EAX,6
      AND AL,0xBF
      OR AL,0x80
      JMP .20:
    ENDP1 EncodeUTF8@RT:
   %ENDMACRO EncodeUTF8
  ENDHEAD string32

▲Back to the top▲