This file can be included to 32bit programs written in Euro Assembler.
It contains OS-independent macros for operations with zero-terminated (ASCIIZ)
strings in ANSI or WIDE (Unicode) encoding.
Unicode string must always be word aligned and terminated with
the zero UNICHAR (word).
Macros may crash the process when the input string is not properly zero-terminated and the following memory is not available for reading.
All functions expect zeroed direction flag on input and they do not change it.
ANSI or WIDE functionality is selected by the current
EUROASM UNICODE=
boolean option. Its value is available in system variable
%^UNICODE
.
Similar macros with identical names for different program width are defined in string16.htm and string64.htm.
string32 HEAD
%^UNICODE
at invocation time.SIZE# SomeText
is 8, memory contains 4800_6900_2100_0000h
EUROASM UNICODE=YES
GetLength$ SomeText ; ECX is now 6 (3 nonzero UNICHARS).
EUROASM UNICODE=NO
GetLength$ SomeText ; ECX is now 1 (1 nonzero BYTE).GetLength$ %MACRO String, Unicode=%^UNICODE PUSHD %String %IF %^Unicode CALL GetLength$W@RT:: GetLength$W@RT:: PROC1 PUSHD EAX,EDI SUB EAX,EAX XOR ECX,ECX MOV EDI,[ESP+12] ; Pointer to String$. DEC ECX REPNE SCASW NOT ECX DEC ECX SHL ECX,1 POP EDI,EAX RET 4 ENDPROC1 GetLength$W@RT:: %ELSE ; Not %Unicode. CALL GetLength$A@RT:: GetLength$A@RT:: PROC1 PUSHD EAX,EDI XOR EAX,EAX XOR ECX,ECX MOV EDI,[ESP+12] ; Pointer to String$. DEC ECX REPNE SCASB NOT ECX DEC ECX POP EDI,EAX RET 4 ENDPROC1 GetLength$A@RT:: %ENDIF %ENDMACRO GetLength$
SIZE# %Destination
.
%^UNICODE
at invocation time.
Concat$ %MACRO Destination, Source1,Source2,,, Size=, Unicode=%^UNICODE %IF %# < 2 ; >> %ERROR ID=5930, 'Missing operand of macro "Concat$".' %EXITMACRO Concat$ %ENDIF PUSH EBP ; Variable number of arguments uses a special stack frame. MOV EBP,ESP ; Store stack pointer. ArgNr %FOR %#..2, STEP= -1 PUSHD %*{%ArgNr} ; All Source pointers, starting with the last. %ENDFOR ArgNr PUSHD %# - 1 ; Number of Source strings to concatenate. %IF "%Size" === "" PUSHD SIZE# %Destination %ELSE PUSHD %Size %ENDIF PUSHD %Destination %IF %Unicode CALL Concat$W@RT:: Concat$W@RT:: PROC1 PUSHAD MOV EBP,ESP MOV EDI,[EBP+36] ; %Destination. MOV EDX,[EBP+40] ; %Size. MOV ECX,[EBP+44] ; Number of Source strings. LEA EDX,[EDI+EDX-2] ; End of allocated Destination. XOR EAX,EAX .20: MOV ESI,[EBP+48] ; Pointer to %Source. .30: LODSW TEST EAX ; Check if it's end of source string. JZ .40: CMP EDI,EDX ; Check if it's end of destination string. CMC JC .80: ; If destination size overflowed. STOSW JMP .30: .40: ADD EBP,4 ; The next Source pointer on stack frame. LOOP .20: .80: MOV AX,0 ; Finally zero-terminate the destination. STOSW POPAD RET ; CF=overflow ENDPROC1 Concat$W@RT:: %ELSE ; If not %Unicode. CALL Concat$A@RT:: Concat$A@RT:: PROC1 PUSHAD MOV EBP,ESP MOV EDI,[EBP+36] ; %Destination. MOV EDX,[EBP+40] ; %Size. MOV ECX,[EBP+44] ; Number of Source strings. LEA EDX,[EDI+EDX-1] ; End of allocated Destination. XOR EAX,EAX .20: MOV ESI,[EBP+48] ; Pointer to %Source. .30: LODSB TEST EAX ; Check if it's end of source string. JZ .40: CMP EDI,EDX ; Check if it's end of destination string. CMC JC .80: ; If destination size overflowed. STOSB JMP .30: .40: ADD EBP,4 ; The next Source pointer on stack frame. LOOP .20: .80: MOV AL,0 ; Finally zero-terminate the destination. STOSB POPAD RET ; CF=overflow ENDPROC1 Concat$A@RT:: %ENDIF MOV ESP,EBP ; Restore the stack. POP EBP %ENDMACRO Concat$
%^UNICODE
at invocation time.
Compare$ %MACRO String1, String2, Unicode=%^UNICODE %IF "%String2" === "" PUSHD EDI %ELSE PUSHD %String2 %ENDIF %IF "%String1" === "" PUSHD ESI %ELSE PUSHD %String1 %ENDIF %IF %Unicode CALL Compare$W@RT:: Compare$W@RT:: PROC1 PUSHAD MOV EBP,ESP SUB EAX,EAX SUB ECX,ECX MOV EDI,[EBP+40] ; %String2. DEC ECX MOV EBX,EDI REPNE:SCASW ; Search for the terminator. SUB EDI,EBX ; Size of String2 in bytes including the UNICHAR NUL. MOV EDX,EDI MOV EDI,[EBP+36] ; %String1. MOV ESI,EDI REPNE:SCASW ; Search for the terminator. MOV ECX,EDI SUB ECX,ESI ; Size of %String1 in bytes including the UNICHAR NUL. CMP ECX,EDX ; Compare string sizes. JNE .90 ; If sizes do not match. MOV EDI,EBX ; String2. REPE CMPSB .90:POPAD RET 2*4 ENDPROC1 Compare$W@RT:: %ELSE ; If not %Unicode. CALL Compare$A@RT:: Compare$A@RT:: PROC1 PUSHAD MOV EBP,ESP SUB EAX,EAX SUB ECX,ECX MOV EDI,[EBP+40] ; %String2. DEC ECX MOV EBX,EDI REPNE:SCASB ; Search for the terminator. SUB EDI,EBX ; Size of String2 in bytes including the NUL. MOV EDX,EDI MOV EDI,[EBP+36] ; %String1. MOV ESI,EDI REPNE:SCASB ; Search for the terminator. MOV ECX,EDI SUB ECX,ESI ; Size of %String1 in bytes including the NUL. CMP EDX,ECX ; Compare string sizes. JNE .90 ; If sizes do not match. MOV EDI,EBX ; String2. REPE CMPSB .90:POPAD RET 2*4 ENDPROC1 Compare$A@RT:: %ENDIF %ENDMACRO Compare$
Macro DecodeUTF8 converts Source UTF-8 string to UTF-16 or UTF-32 string.
Source string is either zero-terminated, or its Size= must be specified.
Conversion stops at NUL byte, which is not converted to output.
Input never reads beyond Source+Size.
If Byte Order Mark (BOM, 0xEF,0xBB,0xBF
) is detected at the beginning of the Source string, it is ignored.
Invalid UTF-8 sequence will send a replacement character 0xFFFD
� to the output.
Byte order in output encoding is always LittleEndian, the same which is used in MS Windows WIDE functions.
If you want to produce UTF-16BE, performXCHG AL,AH
in CallbackProc.
If you want to produce UTF-32BE, performBSWAP EAX
in CallbackProc.
If you want to prefix the output string with BOM, store it to destination buffer before invoking DecodeUTF8.
If you don't like replacement characters (usually displayed as little squares �), filter them out in CallbackProc.
0x0000_D800..0x0000_DFFF
when the input UTF-8 character
belongs to Unicode supplementary planes (Emoji, Asian characters etc).
0x0000_FFFD
when the input UTF-8 string is malformed.
DecodeUTF8 %MACRO Source, CallbackProc, Size=-1, Width=16
%IF %Width != 16 && %Width != 32
%ERROR ID=5932,'Macro "DecodeUTF8" requires Width=16 or Width=32.'
%EXITMACRO DecodeUTF8
%ENDIF
PUSHD %Width, %Size, %CallbackProc, %Source
CALL DecodeUTF8@RT::
DecodeUTF8@RT:: PROC1
PUSHAD
SUB ECX,ECX
MOV [ESP+24],ECX ; Initialize %ReturnECX to 0.
MOV EDI,[ESP+36] ; %Source.
MOV ECX,[ESP+44] ; %Size.
MOV ESI,EDI
MOV EAX,ECX
INC EAX
JZ .Scan: ; If Size=-1, EAX=0 and the Source size will be scanned.
LEA EDI,[ESI+ECX] ; Otherwise use the explicit %Size.
JMP .No0:
.Scan:REPNE:SCASB
JNE .No0:
DEC EDI ; Omit the terminator from conversion.
.No0: ; Source string without NUL is now at ESI..EDI.
BOM %FOR 0xEF,0xBB,0xBF ; Little-Endian BOM (0xFEFF
) encoded in UTF-8.
CMP ESI,EDI
JNB .NoBOM:
LODSB
CMP AL,%BOM
JNE .NoBOM:
%ENDFOR BOM
JMP .Start: ; BOM was detected, ESI is advanced just behind it.
.NoBOM:MOV ESI,[ESP+36] ; No BOM detected, restore ESI to the start of Source again.
.Start:CMP ESI,EDI
JNB .End:
XOR EBX,EBX
LODSB
MOV BL,AL
NOT BL
BSR ECX,EBX ; Scan bits 7..0 of inverted first byte of 1,2,3,4 bytes long UTF-8 character.
MOV BL,AL ; First byte of 1,2,3,4 bytes long UTF-8 character (not inverted).
MOV DL,0x7F ; Prepare mask for payload bits in the 1st UTF-8 byte.
SUB ECX,7 ; ECX=7,5,4,3 change to ECX=0,-2,-3,-4.
JZ .Out: ; When EBX is codepoint 0..0x7F (7bit ASCII character).
NEG ECX ; ECX=2,3,4 (number of bytes in UTF-8 character).
SHR DL,CL ; DL=0x1F,0x0F,0x07 is payload mask.
AND BL,DL ; EBX will accumulate payload bits of codepoint.
CMP CL,2
JB .Bad:
CMP CL,4
JBE .Good:
.Bad: MOV EAX,0xFFFD ; Invalid UTF-8 detected, output the replacement.
JMP .NoSg:
.Good:DEC ECX ; ECX=1, 2 or 3 continuation bytes expected.
LEA EAX,[ESI+ECX]
CMP EAX,EDI ; Check if there's that many input bytes left.
JBE .Cont:
DEC ESI ; Rollback, the last UTF-8 character is incomplete.
SUB EDI,ESI ; EDI characters (1..3) were not decoded.
MOV [ESP+24],EDI ; %ReturnECX.
JMP .End: ; CF=0.
.OutEAX: ; PROC1 ; Send EAX to callback. Preserves EBX,ESI,EDI, updates ReturnEDI.
PUSH EBX,ESI,EDI
MOV EDI,[ESP+16] ; ReturnEDI restore.
CALL [ESP+56] ; CallbackProc.
MOV [ESP+16],EDI ; ReturnEDI update.
POP EDI,ESI,EBX
RET
; ENDPROC1 .OutEAX:
.Cont:LODSB ; Continuation byte AL=10xxxxxxb expected.
BTR EAX,7 ; Reset the marker bit 7.
JNC .Bad:
BTR EAX,6
JC .Bad:
SHL EBX,6 ; Make room in EBX for the next 6 bits.
OR BL,AL ; Accumulate them.
DEC ECX
JNZ .Cont:
.Out: MOV EAX,EBX ; EAX=EBX is now the decoded codepoint 0..0x10_FFFF.
; Check for overlong encodings. DL=0x7F,0x1F,0x0F,0x07 for 1,2,3,4 bytes in UTF-8 character.
CMP EBX,0x01_0000 ; Codepoint 0x01_0000..0x10_FFFF should be encoded in 4 bytes.
JAE .NoOverlong:
CMP EBX,0x00_0800 ; Codepoint 0x00_0800..0x00_FFFF should be encoded in 3 bytes.
JB .2Bts:
CMP DL,0x0F
JE .NoOverlong:
JMP .Bad:
.2Bts:CMP EBX,0x00_0080 ; Codepoint 0x00_0080..0x00_07FF should be encoded in 2 bytes.
JB .1Bts:
CMP DL,0x1F
JE .NoOverlong:
JMP .Bad:
.1Bts:CMP DL,0x7F ; Codepoint 0x00_0000..0x00_007F should be encoded in 1 byte.
JE .NoOverlong:
TEST EBX
JNZ .Bad:
CMP DL,0x1F ; Exception: codepoint 0 may be encoded in 1 or 2 bytes.
JNE .Bad:
.NoOverlong:
SHR EBX,11 ; Check for surrogate codepoints.
CMP BL,0x1B
JE .Bad: ; Do not accept surrogates 0xD800..0xDFFF from input.
TEST BX,0x3E0
JZ .NoSg: ; If codepoint EAX is below 0x0001_0000, surrogates do not apply.
CMPD [ESP+48],16 ; Output UTF %Width (16 or 32).
JNE .NoSg: ; UTF-32 does not need surrogates.
SUB EAX,0x0001_0000 ; Codepoint EAX was not encodable in one UTF-16 character.
MOV EBX,0x0000_03FF ; Use two surrogate Unichars.
AND EBX,EAX
SHR EAX,10
ADD EBX,0x0000_DC00 ; EBX is now low surrogate.
ADD EAX,0x0000_D800 ; EAX is now high surrogate.
CALL .OutEAX: ; High surrogate first.
MOV EAX,EBX ; Low surrogate.
JC .End: ; If aborted by CallbackProc.
.NoSg:CALL .OutEAX: ; Low surrogate or BMP codepoint or UTF-32.
JNC .Start: ; Parse the next UTF-8 character from the string ESI..EDI.
.End:POPAD
RET 4*4
ENDP1 DecodeUTF8@RT::
%ENDMACRO DecodeUTF8
Macro EncodeUTF8 converts a codepoint to an UTF-8 character and stores it into the string at EDI.
EncodeUTF8 %MACRO CALL EncodeUTF8@RT: EncodeUTF8@RT: PROC1 CMP EAX,0x0000_0080 JAE .10: STOSB RET .10: PUSH EAX,ECX MOV ECX,EAX CMP EAX,0x0000_0800 JAE .30: SHR EAX,6 OR AL,0xC0 .20: STOSB MOV EAX,ECX AND AL,0xBF OR AL,0x80 STOSB POP ECX,EAX RET .30: CMP EAX,0x0001_0000 JAE .40: XCHG AL,AH SHR AL,4 OR AL,0xE0 STOSB MOV EAX,ECX SHL EAX,2 XCHG AL,AH AND AL,0xBF OR AL,0x80 JMP .20: .40: SHR EAX,18 AND AL,0xF7 OR AL,0xF0 STOSB MOV EAX,ECX SHR EAX,12 AND AL,0xBF OR AL,0x80 STOSB MOV EAX,ECX SHR EAX,6 AND AL,0xBF OR AL,0x80 JMP .20: ENDP1 EncodeUTF8@RT: %ENDMACRO EncodeUTF8
ENDHEAD string32