cregexposix-ere

Posix regex matching in C... Full match anomaly?


So my current homework is to implement a DBMS-like system in C, and I've decided to use the linux POSIX regex matching library. I'm writing toy codes to learn stuff, and this is my testing for the "command parsing" part of the DBMS. The post is a little long, but please bear with me.

So, the full program code, based on this example:

char *toSubstring(const char *str, int start, int end)
{
  char *buffer = malloc((end - start + 1) * sizeof *buffer);
  memcpy(buffer, str + start, end - start);
  *(buffer + (end - start)) = '\0';

  return buffer;
}

int compile_regex(regex_t *r, const char *r_pattern)
{
  int status = regcomp(r, r_pattern, REG_EXTENDED|REG_NEWLINE);

  if (status != 0) {
    char err_msg[MAX_ERR_MSG];
    regerror(status, r, err_msg, MAX_ERR_MSG);
    printf("REGEX error compiling '%s': %s\n", r_pattern, err_msg);

    return status;
  }
  return 0;
}

int match_regex(regex_t *r, const char *r_text, char ***m_array, int n_matches)
{
  int i, nomatch;
  regmatch_t m_osets[n_matches];
  *m_array = malloc(n_matches * sizeof **m_array);

  nomatch = regexec(r, r_text, n_matches, m_osets, 0);

  if (nomatch) {
    printf("No matches\n");

    return nomatch;
  }

  for (i = 0; i < n_matches ; i++) {
    int start = (int) m_osets[i].rm_so;
    int stop = (int) m_osets[i].rm_eo;

    if (start==-1 || stop==-1) {
      *(*(m_array)+i) = NULL;

      printf("WARNING: Match block %d is void!\n", i);
    } else {
      *(*(m_array)+i) = toSubstring(r_text, start, stop);

      printf("Match block %d @bytes %d:%d\n", i, start, stop);
    }
  }
  return 0;
}

void chafree(char **c, int n)
{
  int i;

  for (i = 0; i < n; i++) {
    if (*(c+i)!=NULL)
      free(*(c+i));
  }
  free(c);
}

int main(int argc, char **argv)
{
  int i, m;
  regex_t r;
  const char * r_text = *(argv+1);
  const char * r_pattern = *(argv+2);

  char **matches = NULL;

  if (argc != 4) {
    printf("Usage: ./program_name \"r_text\" \"r_pattern\" n_matches\n");

    exit(1);
  }
  printf("Sweeping '%s' for '%s'\n", r_text, r_pattern);

  compile_regex(&r, r_pattern);
  match_regex(&r, r_text, &matches, atoi(*(argv+3)));

  if (matches != NULL) {
    for(i=0;i<atoi(*(argv+3));i++){

      if(*(matches+i)!=NULL)
        printf("$%d --> %s\n", i, *(matches+i));
      else printf("$%d --> %p\n", i, *(matches+i));
    }
    chafree(matches, atoi(*(argv+3)));
  }
  regfree(&r);

  return 0;

A slight difference from the example I've presented is the fact that I'm storing the matches and capture groups in a string vector.

Now, when I run the program with:

./regex_drills "insert whatever bananana" "([[:alpha:]]+)[[:blank:]]*([^\0]*)" 3

The output I receive is:

Sweeping 'insert whatever bananana' for '([[:alpha:]]+)[[:blank:]]*([^\0]*)'
Match block 0 @bytes 0:23
Match block 1 @bytes 0:5
Match block 2 @bytes 7:23
$& --> insert whatever bananana!
$1 --> insert
$2 --> whatever bananana

The regex pattern seems to be alright, according to regex 101, but notice the stray " ! " at the end of the full expression. Altough the capture groups are correctly parsed, the unusual character happens (so far) on the full expression range and, for the test cases used up until now, only when it is exactly 24 bytes long. It's possibly a very stupid mistake and I'm sorry for it.

Also, any suggestions on how to handle regex in a better and perhaps more elegant way, in C, would be very welcome. Many thanks in advance.

Edit

So, according to responses it was an offset error internal to toSubstring. It's fixed now and the outputs are smooth, as they ought to. I've also cleaned up the code a bit, as suggested in the comments.

A somewhat invasive run using valgrind reveals no errors or undefined behaviours, unlike what was happening previously:

$ valgrind --leak-check=full --track-origins=yes --show-reachable=yes ./regex_drills "insert whatever bananana" "([[:alpha:]]+)[[:blank:]]*([^\0]*)" 3

==7051== Memcheck, a memory error detector
==7051== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==7051== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==7051== Command: ./regex_drills insert\ whatever\ bananana ([[:alpha:]]+)[[:blank:]]*([^\\0]*) 3
==7051== 
Sweeping 'insert whatever bananana' for '([[:alpha:]]+)[[:blank:]]*([^\0]*)'
Match block 0 @bytes 0:24
Match block 1 @bytes 0:6
Match block 2 @bytes 7:24
$0 --> insert whatever bananana
$1 --> insert
$2 --> whatever bananana
==7051== 
==7051== HEAP SUMMARY:
==7051==     in use at exit: 0 bytes in 0 blocks
==7051==   total heap usage: 167 allocs, 167 frees, 18,458 bytes allocated
==7051== 
==7051== All heap blocks were freed -- no leaks are possible
==7051== 
==7051== For counts of detected and suppressed errors, rerun with: -v
==7051== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

I'd like to thank everyone for the quick response. I've learned plenty from this.


Solution

  • Your toSubstring function has an off-by-one error. end is an exclusive bound, so the length of the substring is len = end - start. You must allocate one more than that to store the terminating null character, but you should only copy len chars and, mor importantly, you should write the terminating null to buffer[len], not buffer[len + 1]:

    char *toSubstring(const char *str, int start, int end)
    {
      char *buffer = malloc(end - start + 1);
      memcpy(buffer, str + start, end - start);
      *(buffer + (end - start)) = '\0';
    
      return buffer;
    }
    

    Maybe you didn't intend end to be exclusive, because you adjust it when you call toSubstring. You can keep these semantics and allocate end - start + 2 chars (and keey the rest of your code), but in C, the upper bound is usually exclusive, so I recommend to use the function as above and to call it like this:

    *(*(m_array) + i) = toSubstring(r_text, start, stop);
    

    instead of (..., stop - 1).