I know that gforth stores characters as their codepoints in the stack, but the material I'm learning from doesn't show any word that helps to convert each character to codepoint.
I also want to sum the codepoints of the string. What should I use to do that?
In Forth we distinguish primitive characters (usually an octet that covers ASCII) and extended characters (usually Unicode).
Any character is always represented in the stack as its code point, but how extended characters are represented in memory is implementation depended.
See also Extended-Character word set:
Extended characters are stored in memory encoded as one or more primitive characters (pchars).
So to convert a character into a code point it's enough to read this character from the memory.
To read a primitive character, we use c@ ( c-addr -- char )
: sum-codes ( c-addr u -- sum ) 0 -rot over + swap ?do i c@ + 1 chars +loop ;
\ test
"test passed" sum-codes .
NB: native string literals are supported in the recent versions of Gforth. Before that you need to use the word s"
as s" test passed"
.
To read an extended character, we can use xc@+ ( xc-addr1 -- xc-addr2 xchar )
: sum-xcodes ( c-addr u -- sum )
over + >r 0 swap
begin ( sum xc-addr ) dup r@ u< while
xc@+ ( sum xc-addr2 xchar ) swap >r + r>
repeat drop rdrop
;
\ test
"test ⇦ ⇨ ⇧ ⇩" 2dup dump cr sum-xcodes . cr
dump
shows that in Gforth the extended characters are stored in the memory in UTF-8 encoding.