I am looking at the source code of cat
from the GNU coreutils, in particular the circle detection. They are comparing device and inode and that works fine, there is however an extra case where they allow the output to be an input, if the input is empty. Looking at the code, this must the lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size)
part. I read the manpages and a discussion that I found from git blame
, but I still cannot quite understand why this call to lseek
is needed.
This is the gist of how cat
detects, if it would infinitely exhaust the disk (note that some error checks have also been removed for brevity, the full source code is linked above):
struct stat stat_buf;
fstat(STDOUT_FILENO, &stat_buf);
out_dev = stat_buf.st_dev;
out_ino = stat_buf.st_ino;
out_isreg = S_ISREG (stat_buf.st_mode) != 0;
// ...
// for <infile> in inputs {
input_desc = open (infile, file_open_mode); // or STDIN_FILENO
fstat(input_desc, &stat_buf);
/* Don't copy a nonempty regular file to itself, as that would
merely exhaust the output device. It's better to catch this
error earlier rather than later. */
if (out_isreg
&& stat_buf.st_dev == out_dev && stat_buf.st_ino == out_ino
&& lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size) // <--- This is the important line
{
// ...
}
// } (end of for)
I have two possible explanations, but both seem kind of weird.
st_size
) and lseek
or open
respects that by offsetting by some default. I wouldn't know why this would be the case, because empty means empty, right?input_desc
would be STDIN_FILENO
and there wouldn't be a file piped to stdin
, lseek
would fail with ESPIPE
(according to the man page) and return -1
. Then, this whole statement would be lseek(...) == -1 || stat_buf.st_size > 0
. But this cannot be true, because this check only happens if device and inode are the same and that can only happen if a) stdin and stdout are pointing to same pty, but then out_isreg
would be false
or b) stdin and stdout point to the same file, but then lseek
cannot return -1
, right?I have also put together a small program that prints out the return values and errno
for the important parts, but there was nothing standing out to me:
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <unistd.h>
int main(int argc, char **argv) {
struct stat out_stat;
struct stat in_stat;
if (fstat(STDOUT_FILENO, &out_stat) < 0)
exit(1);
printf("this is written to stdout / into the file\n");
int fd;
if (argc > 1)
fd = open(argv[1], O_RDONLY);
else
fd = STDIN_FILENO;
fstat(fd, &in_stat);
int res = lseek(fd, 0, SEEK_CUR);
fprintf(stderr,
"errno after lseek = %d, EBADF = %d, EINVAL = %d, EOVERFLOW = %d, "
"ESPIPE = %d\n",
errno, EBADF, EINVAL, EOVERFLOW, ESPIPE);
fprintf(stderr, "input:\n\tlseek(...) = %d\n\tst_size = %ld\n", res,
in_stat.st_size);
printf("outsize is %ld", out_stat.st_size);
}
$ touch empty
$ ./a.out < empty > empty
errno after lseek = 0, EBADF = 9, EINVAL = 22, EOVERFLOW = 75, ESPIPE = 29
input:
lseek(...) = 0
st_size = 0
$ echo x > empty
$ ./a.out < empty > empty
errno after lseek = 0, EBADF = 9, EINVAL = 22, EOVERFLOW = 75, ESPIPE = 29
input:
lseek(...) = 0
st_size = 0
So my ultimate question is untouched from my research: How does lseek
help determine if a file is empty in this example from the cat
source code?
This is my attempt at reverse-engineering this - I could not find any public discussion that explains why lseek()
was put there (no other place in GNU coreutils does that).
The guiding question is: When is the condition lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size
false?
Test case:
#!/bin/bash
# (edited based on comments)
set -x
# arrange for cat to start off past the end of a non-empty file
echo abcdefghi > /tmp/so/catseek/input
# get the shell to open the input file for reading & writing as file descriptor 7
exec 7<>/tmp/so/catseek/input
# read the whole file via that descriptor (but leave it open)
dd <&7
# ask linux what the current file position of file descriptor 7 is
# should be everything dd read, namely 10 bytes, the size of the file
grep ^pos: /proc/self/fdinfo/7
# run cat, with pre and post content so that we know how to locate the interesting part
# "-" will cause cat to reuse its file descriptor 0 rather than creating a new file descriptor
# the redirections tell the shell to redirect file descriptors 1 and 0 to/from our open file descriptor 7
# which, as you'll remember, already has a file position of 10 bytes
strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post <&7 >&7
# now let's see what's in the file
cat /tmp/so/catseek/input
With:
$ cat /tmp/so/catseek/pre
pre
$ cat /tmp/so/catseek/post
post
cat
with lseek (input_desc, 0, SEEK_CUR) < stat_buf.st_size
:
+ test.sh:8:echo abcdefghi
+ test.sh:10:exec
+ test.sh:12:dd
abcdefghi
0+1 records in
0+1 records out
10 bytes copied, 2.0641e-05 s, 484 kB/s
+ test.sh:15:grep '^pos:' /proc/self/fdinfo/7
pos: 10
+ test.sh:20:strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post
lseek(0, 0, SEEK_CUR) = 14
+++ exited with 0 +++
+ test.sh:22:cat /tmp/so/catseek/input
abcdefghi
pre
post
cat
with 0 < stat_buf.st_size
:
+ test.sh:8:echo abcdefghi
+ test.sh:10:exec
+ test.sh:12:dd
abcdefghi
0+1 records in
0+1 records out
10 bytes copied, 3.6415e-05 s, 275 kB/s
+ test.sh:15:grep '^pos:' /proc/self/fdinfo/7
pos: 10
+ test.sh:20:strace -e lseek ./src/cat /tmp/so/catseek/pre - /tmp/so/catseek/post
./src/cat: -: input file is output file
+++ exited with 1 +++
+ test.sh:22:cat /tmp/so/catseek/input
abcdefghi
pre
post
As you can see, when cat
starts, the file position may already be after the end-of-file, and checking just the file size will make cat
skip the file, but also trigger a failure, as the code inside the if
statement is:
error (0, 0, _("%s: input file is output file"), infile);
ok = false;
goto contin;
Using lseek()
allows cat
to say "Oh, the file is the same, and is not empty, BUT our reads will still turn up empty, because that's how reading past EOF works, so we can allow this case".