regexshellawkzshstdin

gawk hangs when using a regex for RS combined with reading a continuous stream from stdin


I'm streaming data using netcat and piping the output to gawk. Here is an example byte sequence that gawk will receive:

=AAAA;=BBBB;;CCCC==DDDD;

The data includes nearly any arbitrary characters, but never contains NULL characters, where = and ; are reserved to be delimiters. As chunks of arbitrary characters are written, each chunk will always be prefixed by one of the delimiters, and always be suffixed by one of the delimiters, but either delimiter can be used at any time: = is not always the prefix, and ; is not always the suffix. It will never write a chunk without also writing an appropriate prefix and suffix. As the data is parsed, I need to distuingish between which delimiter was used, so that my downstream code can properly interpret that information.

Since this is a network stream, stdin remains open after this sequence is read, as it waits for future data. I'd want gawk to read until either delimiter is encountered, and then execute the body of my gawk script with whatever data was found, while ensuring that it properly handles the continuous stream of stdin. I explain this in more detail below.

Thus far

Here is what I have attempted thus far (zsh script, using gawk, on macOS). For this post, I simplified the body to just print the data - my full gawk script has a much more complicated body. I also simplified the netcat stream to instead just cat a file (along with cat'ing stdin in order to mimic the stream behavior).

cat example.txt - | gawk '
BEGIN {
    RS = "=|;";
}
{
    if ($0 != "") {
        print $0;
        fflush();
    }
}
'

example.txt

=AAAA;=BBBB;=CCCC;=DDDD;

My attempt successfully handles most of the data......up until the most-recent record. It hangs waiting for more data from stdin, and fails to execute the body of my script for the most-recent record, despite an appropriate delimiter clearly being available in stdin.

Current output: (fails to process the most-recent record of DDDD)

AAAA
BBBB
CCCC
[hang here, waiting for future data]

Desired output: (successfully processes all records, including the most-recent)

AAAA
BBBB
CCCC
DDDD
[hang here, waiting for future data]

What, exactly, could be the cause of this problem, and how can I potentially address it? I recognize that this seems to be somewhat of an edge-case scenario. Thank you all very much for your help!

Edit: Comment consolidation, misc clarifications, and various observations/realizations

Here are some misc observations I found during debugging, both before and after I originally made this post. These edits also clarify some questions that came up in the comments, and consolidate the info scattered across various comments into a single place. Also includes some realizations I made about how gawk works internally, based on the extremely insightful information in the comments. Info in this edit supersedes any potentially conflicting info that may have been discussed in the comments.

  1. I briefly investigated whether this could be a pipe buffering issue imposed by the OS. After messing with the stdbuf tool to disable all pipe buffering, it seems that buffering is not the problem at all, at least not in the traditional sense (see item #3).

  2. I noticed that if stdin is closed and a regex is used for RS, no problems occur. Conversely, if stdin remains open and RS is not a regex (i.e. a plaintext string), no problems occur either. The problem only occurs if both stdin remains open and RS is a regex. Thus, we can reasonably assume that it's something related to how regex handles having a continuous stream of stdin.

  3. I noticed that if my RS with regex (RS = "=|;";) is 3 characters long...and stdin remains open...it stops hanging after exactly 3 additional characters appear in stdin. If I adjust the length of my regex to be 5 chars (RS = "(=|;)"), the amount of additional characters necessary to return from hanging adjusts accordingly. Combined with the extremely insightful discussion with Kaz, this establishes that the hanging is an artifact of the regex engine itself. Like Kaz said, when the regex engine parses RS = "=|;";, it ends up trying to read additional characters from stdin in order to be sure that the regex is a match, despite this additional read not being strictly necessary for the regex in question, which obviously causes a hang waiting on stdin. I also tried adding lazy quantifiers to the regex, which in theory means the regex engine can return immediately, but alas it does not, as this is an implementation detail of the regex engine.

  4. The gawk docs here and here state that when RS is a single character, it is treated as a plaintext string, and causes RS to match without invoking the regex engine. Conversely, if RS has 2 or more characters, it is treated as a regex, and the regex engine will be invoked (subsequently bringing the problem discussed in item #3 into play). However, this seems to be slightly misleading, which is an implementation detail of gawk. I tried RS = "xy"; (and adjusted my data accordingly), and re-tested my experiment from #3. No hanging occurred and the correct output was printed, which must mean that despite RS being 2 characters, it is still being treated as a plaintext string - the regex engine is never invoked, and the hanging problem never occurs. So, there seems to be some further filtering on whether RS is treated as plaintext or as a regex.

  5. So....now that we've figured out the root cause of the problem....what do we do about it? An obvious idea would be to avoid using regex....but that points toward writing a custom data parser in C or some other language. This hypothetical custom program would parse the input entirely from scratch, and gawk/regex would never be involved anywhere in the lifecycle of my script. Although I could do this, and this would certainly solve the problem, the extent of my full data parsing is somewhat complex, so I'd rather not go down this path of weeds.

  6. This brings us to Ed Morton's workaround, which is probably the best way to go, or some derivative thereof. Summarizing his approach below:

Basically, use other CLI tools to do an ahead-of-time conversion, before data is given to gawk, to add a suffixed NULL character after each potential delimiter. Next, invoke gawk with RS as the NULL character, which would treat RS as a plaintext string and not a regex, which means the hanging problem never comes into play. From there, the real delimiter and data chunk could be decoded and processed in whatever way you want.

Although I have now marked Ed's answer as the solution, I think that my final solution will be a hybrid of Ed's approach, Kaz's insight, some subsequent realizations I made thanks to them, and some arbitrary approach that I can come up with in order to add those suffixed NULL characters. Wish I could mark two answers as solutions! Thank you everyone for your help, especially Ed Morton and Kaz!


Solution

  • A workaround inserting a shell read loop into the pipeline to carve the original awk input (the OPs actual netcat output) up into individual characters and then feed them to awk one at a time:

    cat example.txt - |
    while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
    awk -v RS='\0' '
        /[;=]/ { if (rec != "") { print rec; fflush() }; rec=""; next }
        { rec=rec $0 }
    '
    AAAA
    AAAA
    AAAA
    AAAA
    

    That requires GNU awk or some other that can handle a NUL character as the RS as that's non-POSIX behavior. It does assume your input can't contain NUL bytes, i.e. it's a valid POSIX text "file".

    Read on for how we got there if interested...

    I thought there was at least 1 bug here as I found multiple oddities (see below) so I opened a gawk bug report at https://lists.gnu.org/archive/html/bug-gawk/2024-07/msg00006.html but per the gawk provider, Arnold, the differences in behavior in this case are just implementation details of having to read ahead to ensure the regexp matches the right string.

    It seems there are 3 issues at play here, e.g. using GNU awk 5.3.0 on cygwin:

    1. Different supposedly equivalent regexps produce different behavior:
    $ printf 'A;B;C;\n' > file
    
    $ cat file - | awk -v RS='(;|=)' '{print NR, $0}'
    1 A
    
    $ cat file - | awk -v RS=';|=' '{print NR, $0}'
    1 A
    2 B
    
    $ cat file - | awk -v RS='[;=]' '{print NR, $0}'
    1 A
    2 B
    3 C
    

    (;|=), ;|= and [;=] should be equivalent but clearly they aren't in this case.

    The good news is you can apparently work around that problem using a bracket expression as in the 3rd case above instead of an "or".

    1. The output record trails the input record when the record separator character is the last one in the input, e.g. with no newline after the last ;:
    $ printf 'A;B;C;' > file
    
    $ cat file - | awk -v RS='(;|=)' '{print $0; fflush()}'
    
    $ cat file - | awk -v RS=';|=' '{print $0; fflush()}'
    A
    
    $ cat file - | awk -v RS='[;=]' '{print $0; fflush()}'
    A
    B
    

    The bad news is that that impacts the OPs example:

    $ printf ';AAAA;BBBB;CCCC;DDDD;' > file
    

    With a literal character RS:

    $ cat file - | awk -v RS=';' '{print $0; fflush()}'
    
    AAAA
    BBBB
    CCCC
    DDDD
    

    With a regexp RS that should also make that char literal:

    $ cat file - | awk -v RS='[;]' '{print $0; fflush()}'
    
    AAAA
    BBBB
    CCCC
    
    $ printf ';AAAA;BBBB;CCCC;DDDD;x' > file
    
    $ cat file - | awk -v RS='[;]' '{print $0; fflush()}'
    
    AAAA
    BBBB
    CCCC
    DDDD
    
    1. Adding different characters to the RS bracket expression produces inconsistent behavior (I stumbled across this by accident):
    $ printf 'A;B;C;\n' > file
    
    $ cat file - | awk -v RS='[;|=]' '{print $0; fflush()}'
    A
    
    $ cat file - | awk -v RS='[;a=]' '{print $0; fflush()}'
    A
    B
    C
    

    FWIW I tried setting a timeout:

    $ cat file - | awk -v RS='[;]' 'BEGIN{PROCINFO["-", "READ_TIMEOUT"]=100} {print $0; fflush()}'
    A
    B
    awk: cmd. line:1: (FILENAME=- FNR=3) fatal: error reading input file `-': Connection timed out
    
    $ cat file - | awk -v RS='[;]' -v GAWK_READ_TIMEOUT=1 '{print $0; fflush()}'
    A
    B
    

    and stdbuf to disable buffering:

    $ cat file - | stdbuf -i0 -o0 -e0 awk -v RS='[;]' '{print $0; fflush()}'
    A
    B
    

    and matching every character (thinking I could then use RT ~ /[=;]/ to find the separator):

    $ cat file - | awk -v RS='(.)' '{print RT; fflush()}'
    A
    ;
    B
    ;
    C
    

    but none of them would let me read the last record separator so at this point I don't know what the OP could do to successfully read the last record of continuing input using a regexp other than something like this:

    $ printf 'A;B;C;' > file
    
    $ cat file - |
        while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
        awk -v RS='\0' '/[;=]/ { print rec; fflush(); rec=""; next } { rec=rec $0 }'
    A
    B
    C
    

    and using the OPs sample input but with different text per record to make the mapping of input to output records clearer:

    $ printf '=AAAA=BBBB;CCCC;DDDD=' > example.txt
    
    $ cat example.txt - |
        while IFS= read -r -d '' -N1 char; do printf '%s\0' "$char"; done |
        awk -v RS='\0' '/[;=]/ { print rec; fflush(); rec=""; next } { rec=rec $0 }'
    
    AAAA
    BBBB
    CCCC
    DDDD
    

    We're using NUL chars as the delimiters and various options above to make the shell read loop robust enough to handle blank lines and other white space in the input, see https://unix.stackexchange.com/a/49585/133219 and https://unix.stackexchange.com/a/169765/133219 for details on those issues. We're additionally using a NUL char for the awk RS so it can distinguish between newlines coming from the original input vs a newline as a terminating character being added by the shell printf, otherwise rec in the awk script could never contain a newline as they'd ALL be consumed by matching the default RS.

    We're using a pipe to/from the while-read loop instead of process substitution just to ease clarity since the OP is already using pipes.