encodingutf-8vbscriptwincc

How to get the UTF-8 code from a single character in VBScript


I would like to get the UTF-8 Code of a character, have attempted to use streams but it doesn't seem to work:

Example: פ should give 16#D7A4, according to https://en.wikipedia.org/wiki/Pe_(Semitic_letter)#Character_encodings

Const adTypeBinary = 1
Dim adoStr, bytesthroughado
Set adoStr = CreateObject("Adodb.Stream")
    adoStr.Charset = "utf-8"
    adoStr.Open
    adoStr.WriteText labelString
    adoStr.Position = 0 
    adoStr.Type = adTypeBinary
    adoStr.Position = 3 
    bytesthroughado = adoStr.Read
    Msgbox(LenB(bytesthroughado)) 'gives 2
    adoStr.Close
Set adoStr = Nothing
MsgBox(bytesthroughado) ' gives K

Note: AscW gives Unicode - not UTF-8


Solution

  • The bytesthroughado is a value of byte() subtype (see 1st output line) so you need to handle it in an appropriate way:

    Option Explicit
    
    Dim ss, xx, ii, jj, char, labelString
    
    labelString = "ařЖפ€"
    ss = ""
    For ii=1 To Len( labelString)
      char = Mid( labelString, ii, 1)
      xx = BytesThroughAdo( char)
      If ss = "" Then ss = VarType(xx) & " " & TypeName( xx) & vbNewLine
      ss = ss & char & vbTab
      For jj=1 To LenB( xx)
          ss = ss & Hex( AscB( MidB( xx, jj, 1))) & " "
      Next
      ss = ss & vbNewLine
    Next   
    
    Wscript.Echo ss
    
    Function BytesThroughAdo( labelChar)
        Const adTypeBinary = 1  'Indicates binary data.
        Const adTypeText   = 2  'Default. Indicates text data.
        Dim adoStream
        Set adoStream = CreateObject( "Adodb.Stream")
        adoStream.Charset = "utf-8"
        adoStream.Open
        adoStream.WriteText labelChar
        adoStream.Position = 0 
        adoStream.Type = adTypeBinary
        adoStream.Position = 3 
        BytesThroughAdo = adoStream.Read
        adoStream.Close
        Set adoStream = Nothing
    End Function
    

    Output:

    cscript D:\bat\SO\61368074q.vbs
    
    8209 Byte()
    a       61
    ř       C5 99
    Ж       D0 96
    פ       D7 A4
    €       E2 82 AC
    

    I used characters ařЖפ€ to demonstrate the functionality of your UTF-8 encoder (the alts8.ps1 PowerShell script comes from another project):

    alts8.ps1 "ařЖפ€"
    
    Ch Unicode     Dec    CP    IME     UTF-8   ?  IME 0405/cs-CZ; CP852; ANSI 1250
    
     a  U+0061      97         …97…      0x61   a  Latin Small Letter A
     ř  U+0159     345         …89…    0xC599  Å�  Latin Small Letter R With Caron
     Ж  U+0416    1046         …22…    0xD096  Ð�  Cyrillic Capital Letter Zhe
     פ  U+05E4    1508        …228…    0xD7A4  פ  Hebrew Letter Pe
     €  U+20AC    8364        …172…  0xE282AC â�¬  Euro Sign