bashwc

How does wc know in advance how far to indent output?


When running wc -l to get line counts for multiple files in bash, the output is formatted with a column of line counts and a column of file names. The line counts always seem to be appropriately space-indented so that they are right-aligned. For example:

> seq 1 10 > tmp1
> seq 1 1000 > tmp2
> seq 1 1000000 > tmp3
> wc -l tmp*

     10 tmp1
   1000 tmp2
1000000 tmp3
1001010 total

So my question is, how does wc know in advance how far to indent the line count for the first file without fully reading all files? In many years of relying on wc for data management, I don't think I have ever seen the line counts misaligned, but when working with very large files the output appears one file at a time (i.e. the readout is not being post-processed)


Solution

  • So my question is, how does wc know in advance how far to indent the line count for the first file without fully reading all files?

    It doesn't know exactly, for certain, under all circumstances. @LéaGris provided a link to the relevant source code for GNU's implementation. It does this:

    That is much faster than reading the file data, and its cost is independent of file size. It might not suffice for proper alignment if any of the specified files is a device file, pipe, or other special file, but otherwise, it provides a width appropriate for the total character count, which cannot be smaller than any of the other counts that wc outputs. The chosen width may be more or less than needed if any special files are involved, but if it's more then the output is still aligned, and the data read from such files needs to get fairly big before wc's guess is too small.

    And there are simpler alternatives. For example, one could just choose a fixed width. Perhaps, but not necessarily, one large enough for the maximum representable combined size.