cpcrepcre2

How do you set output size in pcre2_substitute


I use pcre2_substitute in C.

PCRE2_SPTR pattern;
PCRE2_SPTR replacement;
PCRE2_SPTR subject;

pcre2_code *re;
int errornumber;
int i;
int rc;

PCRE2_SIZE erroroffset;
PCRE2_SIZE *ovector;

size_t subject_length;
size_t replacement_length = strlen((char *)replacement);

pcre2_match_data *match_data;

subject_length = strlen((char *)subject);

PCRE2_UCHAR output[1024] = "";
PCRE2_SIZE outlen = sizeof(output) / sizeof(PCRE2_UCHAR);

re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, PCRE2_DOTALL, &errornumber, &erroroffset, NULL);
if (re == NULL)
{
    PCRE2_UCHAR buffer[256];
    pcre2_get_error_message(errornumber, buffer, sizeof(buffer));
    printf("PCRE2 compilation failed at offset %d: %s\n", (int)erroroffset, buffer);
}

match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_substitute(re, subject, subject_length, 0, 
     PCRE2_SUBSTITUTE_GLOBAL | PCRE2_SUBSTITUTE_EXTENDED, 
     match_data, NULL, replacement, replacement_length, output, &outlen);

The output string is set by

PCRE2_UCHAR output[1024] = "";

If the string is longer than 1024 characters, pcre2_substitute returns -48 error.

Before the substitution, we do not know the required length of the output.

How do you define a sufficiently large output string?


Solution

  • Use the flag PCRE2_SUBSTITUTE_OVERFLOW_LENGTH in your call. That will cause the scan to continue if it runs out of memory, without actually adding anything to the output buffer, in order to compute the actual length of the substitution, which is stored in the outlengthptr argument. The function still returns PCRE2_ERROR_NOMEMORY, so you can tell that more memory is required. If you get this error return, you use the value stored through outlengthptr to malloc() a sufficiently large output buffer, and do the call again.

    It's legal (and not uncommon) to do the first call with a supplied output length of 0, and then unconditionally do the allocation and second call. That's the simplest code. Supplying a buffer which is probably large enough, and handling overflow as indicated above, is a way of avoiding the repeated call, thereby saving a bit of time. How effective that optimisation is depends on your ability to guess a reasonable initial buffer size. If you just use a fixed-length buffer, then the second call will be performed only on large substitutions, which is another way of saying that the optimisation will only be effective on short substitutions (where it is least important). YMMV.

    See the pcre2_substitute section in man pcre2api for a slightly longer discussion of this mechanism.