Can this be done using a method similar to this one:
As long as the current element of the string the user input via scanf is not \0, add one to the "length" int and then print out the length.
I would be very grateful if anybody could guide me through the least complex way possible as I am a beginner.
What do you mean by string length?
The UTF-8 encoding is very well designed and compatible with the definition of C strings, UTF-8 strings are just null terminated arrays of bytes, like ASCII strings.
The number of bytes is easily obtained with strlen(s)
. If for some reason you cannot use strlen
, it is easy to emulate and the algorithm is exactly what you propose in the question:
size_t string_lengh(const char *s) {
size_t length = 0;
while (*s++ != '\0')
length++;
return length;
}
The number of code points encoded in UTF-8 can be computed by counting the number of single byte chars (range 1 to 127) and the number of leading bytes (range 0xC0 to 0xFF), ignoring continuation bytes (range 0x80 to 0xBF) and stopping at '\0'
.
Here is a simple function to do this:
size_t count_utf8_code_points(const char *s) {
size_t count = 0;
while (*s) {
count += (*s++ & 0xC0) != 0x80;
}
return count;
}
This function assumes that the contents of the array pointed to by s
is properly encoded.
Also note that this will compute the number of code points, not the number of characters displayed, as some of these may be encoded using multiple combining code points, such as <LATIN CAPITAL LETTER A>
followed by <COMBINING ACUTE ACCENT>
.