getline
is specified to read "from the current input file" and to return 0 at the end-of-file. Both gawk and POSIX docs use this verbiage. It makes sense: Data may be divided between files for a reason. The language is more expressive if getline
can distinguish files. Information that is structured enough to warrant getline
usually doesn't cross file boundaries.
But both GNU and macOS/BSD implementations hide the EOF and immediately open the next file. Doing so, they update FILENAME
, which is not among the list of variables specified to be affected in either GNU nor POSIX docs.
The only workaround I see is to make sure that each file starts with a throwaway line, and detect when FNR
resets to 1. Yuck.
It's a strange coincidence for both implementations to have this bug. Looking at the source, neither behavior is negligent. Both take specific steps to advance the file, in contrast to the code branch for getline
from a named I/O handle. It's especially weird for the verbose GNU docs to contradict the behavior.
Am I missing something? Have I stumbled across an uncommon case or is this known to Awk lore?
It sounds like you just want a want a way to know when a getline
loop reaches end of file so here is a way to do that.
Using these input files which includes 1 empty file (not impossible to handle in any awk but significantly harder to handle without ENDFILE
because getline
won't return between attempting to read that empty file and reading the first line of the next file and ARGV[]
could contain variable assignments intermingled with file names and could contain multiple occurrences of the same file name):
$ head file{1..3}
==> file1 <==
foo
bar
==> file2 <==
==> file3 <==
some
other
stuff
With GNU awk you could do:
$ awk '
BEGIN {
while ( getline > 0 ) {
print FILENAME, $0
}
}
ENDFILE {
print "Finished reading", FILENAME
}
' file{1..3}
file1 foo
file1 bar
Finished reading file1
Finished reading file2
file3 some
file3 other
file3 stuff
Finished reading file3
but without a concrete example of the problem you're trying to solve, I don't know if that would be a solution to it or not.
By the way...
The docs should say "end of input" instead of "end of file". You could open a bug report with the gawk providers for that if you like, see https://www.gnu.org/software/gawk/manual/gawk.html#Bug-address, and/or the POSIX people, see http://www.opengroup.org/austin/.
You say in the question:
Doing so, they update FILENAME, which is not among the list of variables specified to be affected in either GNU nor POSIX docs.
but the GNU awk manual says the following in the Points to Remember About getline section:
It is worth noting that those variants that do not use redirection can cause FILENAME to be updated if they cause awk to start reading a new input file.
That is made explicit in the table of impacted variables in the AllAboutGetline* article though.
Make sure to read http://awk.freeshell.org/AllAboutGetline (if that site is down see the archive at https://web.archive.org/web/20221109201352/http://awk.freeshell.org/AllAboutGetline) if you're ever considering using getline
.