How to extract the initial/final HTML characters from a string (with a reasonably good performance)?

I have fragments of HTML text (without tags). They can contain numeric character references and named character references. Any such reference counts as one HTML character. A directly encoded character in the text counts as one HTML character too. The problem: given a string, output N initial/final HTML characters.

So I need two functions that solve this problem. I will call these functions extractinitial and extractfinal.

For example, let the input string text be

&quot;ab&#x0022;&excl;cde.

It contains three character references (", ", &excl;) and six directly encoded characters. Then extractinitial(text, 5) must be "ab"&excl; and extractfinal(text, 5) must be &excl;cde. So I wrote these two functions as follows (I assume that a named/numeric character reference matches the /&[0-9a-z#]{1,};/ regex):

function extractinitial(text, num,    o, len, htmllen, k, j, curstr, b, curch, charlen) {
  o = "";
  len = length(text);
  htmllen = 0;
  k = 1;
  for (j = 1; j <= len; j += k) {
    curstr = substr(text, j);
    if (curstr ~ /^&[0-9a-z#]{1,};/) {
      match(curstr, /^(&[0-9a-z#]{1,};)/, b);
      curch = b[1];
      charlen = length(curch);
      o = o curch;
      htmllen = htmllen + 1;
      k = charlen;
      if (htmllen == num) {
        return o
      }
    }
    else {
      match(curstr, /^(.)/, b);
      curch = b[1];
      o = o curch;
      htmllen = htmllen + 1;
      k = 1;
      if (htmllen == num) {
        return o
      }
    }
  };
  return o
};

function extractfinal(text, num,    o, len, l, htmllen, k, j, curstr, b, curch, charlen) {
  o = "";
  len = length(text);
  l = len;
  htmllen = 0;
  k = 1;
  for (j = len; j >= 1; j -= k) {
    curstr = substr(text, 1, l);
    if (curstr ~ /&[0-9a-z#]{1,};$/) {
      match(curstr, /(&[0-9a-z#]{1,};)$/, b);
      curch = b[1];
      charlen = length(curch);
      o = curch o;
      htmllen = htmllen + 1;
      k = charlen;
      l = l - k;
      if (htmllen == num) {
        return o
      }
    }
    else {
      match(curstr, /(.)$/, b);
      curch = b[1];
      o = curch o;
      htmllen = htmllen + 1;
      k = 1;
      l = l - k;
      if (htmllen == num) {
        return o
      }
    }
  };
  return o
};

{
    text = $0
    textstart = extractinitial(text, 5);
    textend = extractfinal(text, 5);
    print "START:", textstart
    print "END:", textend
}

$ awk -f tst.awk file
START: &quot;ab&#x0022;&excl;
END: &excl;cde.

But these functions are absolutely unusable for me because their performance is too low (at least 15–20 times slower than the minimal acceptable level). If my strings are ~200 kilobytes long, these functions output only ~200 substrings in five minutes, and I have no idea why. This is absolutely unacceptable.

So I have two questions: 1. Is there anything wrong in these two functions that makes them so slow? 2. How to rewrite the functions so they have an acceptable performance?

I need to solve the problem using only gawk.

Solution

Try this, using any POSIX awk:

$ cat tst.awk
function makemap(text, map,     head, pos) {
    delete map
    while ( match(text, /&[[:alnum:]#]+;/) ) {
        head = head substr(text,1,RSTART-1) "x"
        map[pos += RSTART] = substr(text,RSTART,RLENGTH)
        text = substr(text,RSTART+RLENGTH)
    }
    return (head text)
}

function extract(text, num, fwd,        map, beg, end, len, str, i) {
    text = makemap(text, map)
    len = length(text)
    num = ( len > num ? num : len )
    if ( fwd ) {
        beg = 1
        end = num
    }
    else {
        beg = len + 1 - num
        end = len
    }
    for ( i=beg; i<=end; i++ ) {
        str = str ( i in map ? map[i] : substr(text,i,1) )
    }
    return str
}

function extractinitial(text, num) {
    return extract(text, num, 1)
}

function extractfinal(text, num) {
    return extract(text, num, 0)
}

{
    text = $0
    textstart = extractinitial(text, 5);
    textend = extractfinal(text, 5);
    print "START:", textstart
    print "END:", textend
}

$ awk -f tst.awk file
START: &quot;ab&#x0022;&excl;
END: &excl;cde.

It works by creating an array map[] where each element in the array is the encoded string from the input (e.g. "), indexed by the character position where that string starts and converting each such string in text to a single character (I used "x" but can be any character). Then the extract() function loops through the desired number of characters from text in the desired order, populating str with the string from map[] if the current character position is an index of map[] or the literal character from text otherwise, then returns str.

If you wanted to try a GNU-awk specific solution you could change makemap() to use patsplit() instead of while(match(...)):

function makemap(text,     i, n, coded, literal) {
    delete map
    n = patsplit(text, coded, /&[[:alnum:]#]+;/, literal)
    text = literal[0]
    for ( i=1; i<=n; i++ ) {
        map[length(text) + 1] = coded[i]
        text = text "x" literal[i]
    }
    return text
}

but I don't know if that'd execute any faster.

You could obviously make it faster if you make map[] global instead of function-local and call makemap() once before calling either extract...() function, e.g.:

$ cat tst.awk
function makemap(text,     head, pos) {
    delete map
    while ( match(text, /&[[:alnum:]#]+;/) ) {
        head = head substr(text,1,RSTART-1) "x"
        map[pos += RSTART] = substr(text,RSTART,RLENGTH)
        text = substr(text,RSTART+RLENGTH)
    }
    return (head text)
}

function extract(text, num, fwd,        beg, end, len, str, i) {
    len = length(text)
    num = ( len > num ? num : len )
    if ( fwd ) {
        beg = 1
        end = num
    }
    else {
        beg = len + 1 - num
        end = len
    }
    for ( i=beg; i<=end; i++ ) {
        str = str ( i in map ? map[i] : substr(text,i,1) )
    }
    return str
}

function extractinitial(text, num) {
    return extract(text, num, 1)
}

function extractfinal(text, num) {
    return extract(text, num, 0)
}

{
    text = makemap($0)
    textstart = extractinitial(text, 5);
    textend = extractfinal(text, 5);
    print "START:", textstart
    print "END:", textend
}

but it sounded like you needed 2 independent functions.

You MIGHT be able to make it more efficient still if you can figure out how to change the loop

for ( i=beg; i<=end; i++ ) {
    str = str ( i in map ? map[i] : substr(text,i,1) )
}

so that instead of calling substr() 1 char at a time it gets called once for however many literal chars exist between the current character position and the next index in map[] - left as an exercise :-). I say "MIGHT" because the extra processing time needed to create additional data and/or perform calculations to do that might outweigh the benefit of doing it on average.

By the way, regarding why the original code is slow - I expect it's because regexp matches are relatively slow and:

Each of the original functions is doing a regexp match num times per input string (and it sounds like num could be a large number) while I'm doing it as many times as there are coded characters in the input which presumably is a smaller number,
The original functions are doing an additional call to match() num times when I'm not doing any additional.
The extractfinal() function is, on every iteration, searching from the start of text for the regexp at the end of text which is a lot of characters for the regexp engine to grind through looking for a match that can only occur at the end of it each time around.

You could in theory make my first script above faster by passing num as an argument to makemap() and only looping forward or backwards on text up to that many times to only find up to that many coded chars instead of all of them as more than num in each direction will be ignored anyway BUT then you'd run into the same problem that the OPs extractfinal() has of trying to match regexps at the end of a long string so the end result would probably actually be slower.

Anyway, if I were to write the OPs code in a similar style to that in the question then here's how I'd write it:

$ cat orig.awk
function extractinitial(text, num,    o, len, htmllen, j, curstr, curch, charbeg, charlen) {
  len = length(text)
  htmllen = 0
  charlen = 1
  for (j = 1; j <= len; j += charlen) {
    curstr = substr(text, j)
    if ( match(curstr, /^&[[:alnum:]#]+;/) ) {
      charbeg = RSTART
      charlen = RLENGTH
    }
    else {
      charbeg = 1
      charlen = 1
    }
    curch = substr(curstr, charbeg, charlen)
    o = o curch
    htmllen++
    if (htmllen == num) {
      break
    }
  }
  return o
}

function extractfinal(text, num,    o, len, n, htmllen, j, curstr, curch, charbeg, charlen) {
  len = length(text)
  n = len
  htmllen = 0
  charlen = 1
  for (j = len; j >= 1; j -= charlen) {
    curstr = substr(text, 1, n)
    if ( match(curstr, /&[[:alnum:]#]+;$/) ) {
      charbeg = RSTART
      charlen = RLENGTH
    }
    else {
      charbeg = j
      charlen = 1
    }
    curch = substr(curstr, charbeg, charlen)
    o = curch o
    htmllen++
    n -= charlen
    if (htmllen == num) {
      break
    }
  }
  return o
}

{
    text = $0
    textstart = extractinitial(text, 5)
    textend = extractfinal(text, 5)
    print "START:", textstart
    print "END:", textend
}

which fixes the first 2 problems I mention above and removes unnecessary calls to length() but I suspect the way extractfinal() is searching the whole of text for a regexp match at the end of it on every iteration is going to still be a performance issue. You can check if extractfinal() is the problem by not calling it and see if the code runs more than twice as fast.

By the way, I renamed the variable l to n as you should never name a variable l as it looks far too much like the number 1 and so obfuscates your code. Also, duplicate code is bad code so any time you see a lot of duplication in your code take the time to think about that and try to refactor it to put the common code in one place (taking performance into account where appropriate of course).