winapiutf-16richedit

Iterating WCHARs in a Rich Edit contol


I am working with a Rich Edit control, and would like to iterate the WCHAR values in it.

I have written this routine:

int GetCharacterAtIndex(int pos) const
{
    TEXTRANGE tr{};
    tr.chrg.cpMin = pos;
    tr.chrg.cpMax = pos + 1;
    WCHAR buffer[2]{};
    tr.lpstrText = buffer;
    DWORD charsRetrieved = SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_GETTEXTRANGE, 0, (LPARAM)&tr);
    assert(charsRetrieved < DIM(buffer));
    if (charsRetrieved <= 0) return -1;
    return buffer[0];
}

This works great as long as the character at pos is not part of a UTF-16 surrogate pair. However, if it is part of a surrogate pair, the value returned is always the low surrogate. This is doubly annoying because the value returned in charsRetrieved is still just 1.

I modified the code as follows to get the high surrogates using EM_GETSELTEXT:

int GetCharacterAtIndex(int pos) const
{
    TEXTRANGE tr{};
    tr.chrg.cpMin = pos;
    tr.chrg.cpMax = pos + 1;
    WCHAR buffer[10]{}; // add some extra space
    tr.lpstrText = buffer;
    DWORD charsRetrieved = SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_GETTEXTRANGE, 0, (LPARAM)&tr);
    assert(charsRetrieved < DIM(buffer));
    if (charsRetrieved <= 0) return -1;
    int index = 0;
    if (buffer[0] >= 0xD800 && buffer[0] < 0xDC00)
    {
        SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_SETSEL, pos, pos + 1);
        DWORD surrogateStart{}, surrogateEnd{};
        SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_GETSEL, (WPARAM)&surrogateStart, (LPARAM)&surrogateEnd);
        if (surrogateStart == surrogateEnd)
        {
            // this may be a WinAPI bug. When pos actually points to the low surrogate, you sometimes have to do this twice.
            SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_SETSEL, pos, pos + 1);
            SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_GETSEL, (WPARAM)&surrogateStart, (LPARAM)&surrogateEnd);
        }
        LRESULT gotChars = SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_GETSELTEXT, 0, (LPARAM)buffer);
        assert(gotChars < DIM(buffer));
        if (gotChars <= 0) return -1;
        index = pos - surrogateStart;
        assert(index < gotChars);
    }
    return buffer[index];
}

This works, but it relies on EM_GETSELTEXT. The problem with EM_GETSELTEXT is that it requires mucking with the selection. This causes visual artifacts, even if you suppress redraw. And it leads to potentially even messier code to suppress selection change notifications (or not), and it requires the caller to save and restore the selection if required.

The bottom line is that it is very slow and it creates ugly visual artifacts.

Have I missed a more elegant way to accomplish this?

EM_GETTEXTRANGE is fast, silent, and self-contained. I would really like not to have to use EM_GETSELTEXT.


Solution

  • Based on helpful comments on my original question, I think the best solution is to modify the API so that it always returns a complete Unicode character in a std::wstring. The caller can then use the size method on the return value to increment the iterator if required.

    Also, I corrected my terminology from the question. The high surrogate comes first. (In my question, I had the terminology backwards.)

    std::wstring GetCharacterAtIndex(int pos) const
    {
        TEXTRANGE tr{};
        tr.chrg.cpMin = pos;
        tr.chrg.cpMax = pos + 2; // in case it's a surrogate pair
        std::wstring buffer(10, 0); // extra space 
        tr.lpstrText = buffer.data();
        DWORD charsRetrieved = SendDlgItemMessageW(GetWindowHandle(), GetControlID(), EM_GETTEXTRANGE, 0, (LPARAM)&tr);
        assert(charsRetrieved < buffer.size());
        if (charsRetrieved <= 0) return L"";
        // EM_GETTEXTRANGE always returns the start of a full surrogate pair, even if pos points to a low surrogate.
        if (buffer[0] < 0xD800 || buffer[0] >= 0xDC00) // when buffer[0] is not a surrogate
            charsRetrieved = 1;
        buffer.resize(charsRetrieved);
        return buffer;
    }