cperlperl-xs

How to process a string char by char in the XS code


Let's suppose there is a piece of code like this:

  my $str = 'some text';
  my $result = my_subroutine($str);

and my_subroutine() should be implemented as Perl XS code. For example it could return the sum of bytes of the (unicode) string.

In the XS code, how to process a string (a) char by char, as a general method, and (b) byte by byte, if the string is composed of ASCII codes subset (a built-in function to convert from the native data srtucture of a string to char[]) ?


Solution

  • At the XS layer, you'll get byte or UTF-8 strings. In the general case, your code will likely contain a char * to point at the next item in the string, incrementing it as it goes. For a useful set of UTF-8 support functions to use in XS, read the "Unicode Support" section of perlapi


    An example of mine from http://cpansearch.perl.org/src/PEVANS/Tickit-0.15/lib/Tickit/Utils.xs

    int textwidth(str)
        SV *str
      INIT:
        STRLEN len;
        const char *s, *e;
    
      CODE:
        RETVAL = 0;
    
        if(!SvUTF8(str)) {
          str = sv_mortalcopy(str);
          sv_utf8_upgrade(str);
        }
    
        s = SvPV_const(str, len);
        e = s + len;
    
        while(s < e) {
          UV ord = utf8n_to_uvchr(s, e-s, &len, (UTF8_DISALLOW_SURROGATE
                                                   |UTF8_WARN_SURROGATE
                                                   |UTF8_DISALLOW_FE_FF
                                                   |UTF8_WARN_FE_FF
                                                   |UTF8_WARN_NONCHAR));
          int width = wcwidth(ord);
          if(width == -1)
            XSRETURN_UNDEF;
    
          s += len;
          RETVAL += width;
        }
    
      OUTPUT:
        RETVAL
    

    In brief, this function iterates the given string one Unicode character at a time, accumulating the width as given by wcwidth().