cregexgtkglibgtktreeview

Markup matches and escape special characters outside the matches at the same time


I have a search functionality for a treeview that highlights all matches, incl. distinction between caseless and case-sensitive, as well as distinction between regular expression and literal. However, I have a problem when the current cell contains special characters that are not part of the matches. Consider the following text inside a treeview cell:

father & mother

Now I want to do for example a search on the whole treeview for the letter 'e'. For highlighting the matches only and not the whole cell, I need to use markup. To achieve this, I use g_regex_replace_eval and its callback function in the way as stated inside the GLib documentation. The resulting new marked up text for the cell would be like this:

fath<span background='yellow' foreground='black'>e</span>r & 
moth<span background='yellow' foreground='black'>e</span>r

If there are special characters inside the matches, they are escaped before being added to the hashtable that is used by the eval function. So special characters inside matches are no problem.

But I have the '&' now outside the markup parts, and it has to be changed to &amp;, otherwise the markup won't show up in the cell and a warning

Failed to set text from markup due to error parsing markup: Error on line x: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &

will be shown inside the terminal.

If I use g_markup_escape_text on the new cell text, it will obviously not only escape the '&', but also the '<' and '>' of the markup, so this is no solution.

Is there a reasonable way to put markup around the matches and escape special characters outside the markup at the same time or with a view steps? Everything I could imagine so far is much too complicated, if it would work at all.

Even though I had already considered Philip's suggestion in most of its parts before asking my question, I had not touched yet the subject of utf8, so he gave an important hint for the solution. The following is the core of a working implementation:

gchar *counter_char = original_cell_txt; // counter_char will move through all the characters of original_cell_txt.
gint counter;

gunichar unichar;
gchar utf8_char[6]; // Six bytes is the buffer size needed later by g_unichar_to_utf8 (). 
gint utf8_length;
gchar *utf8_escaped;

enum { START_POS, END_POS };
GArray *positions[2];
positions[START_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
positions[END_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
gint start_position, end_position;

txt_with_markup = g_string_new ("");    

g_regex_match (regex, original_cell_txt, 0, &match_info);

while (g_match_info_matches (match_info)) {
    g_match_info_fetch_pos (match_info, 0, &start_position, &end_position);
    g_array_append_val (positions[START_POS], start_position);
    g_array_append_val (positions[END_POS], end_position);
    g_match_info_next (match_info, NULL);
}

do {
    unichar = g_utf8_get_char (counter_char);
    counter = counter_char - original_cell_txt; // pointer arithmetic

    if (counter == g_array_index (positions[END_POS], gint, 0)) {
        txt_with_markup = g_string_append (txt_with_markup, "</span>");
        // It's simpler to always access the first element instead of looping through the whole array.
        g_array_remove_index (positions[END_POS], 0);
     }
     /*
         No "else if" is used here, since if there is a search for a single character going on and  
         such a character appears double as 'm' in "command", between both m's a span tag has to be 
         closed and opened at the same position.
     */
     if (counter == g_array_index (positions[START_POS], gint, 0)) {
         txt_with_markup = g_string_append (txt_with_markup, "<span background='yellow' foreground='black'>");
         // See the comment for the similar instruction above.
         g_array_remove_index (positions[START_POS], 0);
     }

     utf8_length = g_unichar_to_utf8 (unichar, utf8_char);
     /*
         Instead of using a switch statement to check whether the current character needs to be escaped, 
         for simplicity the character is sent to the escape function regardless of whether there will be 
         any escaping done by it or not.
     */
     utf8_escaped = g_markup_escape_text (utf8_char, utf8_length);

     txt_with_markup = g_string_append (txt_with_markup, utf8_escaped);

     // Cleanup
     g_free (utf8_escaped);

     counter_char = g_utf8_find_next_char (counter_char, NULL);
} while (*counter_char != '\0');

/*
    There is a '</span>' to set at the end; because the end position is one position after the string size
    this couldn't be done inside the preceding loop.
*/            
if (positions[END_POS]->len) {
    g_string_append (txt_with_markup, "</span>");
}

g_object_set (txt_renderer, "markup", txt_with_markup->str, NULL);

// Cleanup
g_regex_unref (regex);
g_match_info_free (match_info);
g_array_free (positions[START_POS], TRUE);
g_array_free (positions[END_POS], TRUE);

Solution

  • Probably the way to do this is to not use g_regex_replace_eval(), but rather to use g_regex_match_all() to get the list of matches for a string. Then you need to step through the string character-by-character (do this using the g_utf8_*() functions, since this has to be Unicode-aware). If you get to a character which needs to be escaped (<, >, &, ", '), output the escaped entity for it. When you get to a match position, output the correct markup for it.