linuxawkcountwc

Counting Records in Linux Files Excluding Some Files


I have to count the number of records I have in 6 files, each file contains 4 million records (the count should be as fast as possible), however there is another file with a similar name which should be omitted.

fileSales_1.txt (4 million records)

fileSales_2.txt (4 million records)

fileSales_3.txt (4 million records)

fileSales_4.txt (4 million records)

fileSales_5.txt (4 million records)

fileSales_6.txt (4 million records)

fileSales_unique.txt (24 million records)

I'm counting the logs with the following command: awk 'END {pint NR}' fileSales_*.txt

However, in doing so, the fileSales_unique.txt archive also counts, giving a total of 48 million records

Could you help me with an instruction which only counts the number of records for files 1 to 6? The result should be 24 million records, awk 'END {pint NR}' fileSales_(1 to 6).txt


Solution

  • Suppose you have these files (using wc to show both file names and size):

     4000000 fileSales_1.txt
     4000000 fileSales_2.txt
     4000000 fileSales_3.txt
     4000000 fileSales_4.txt
     4000000 fileSales_5.txt
     4000000 fileSales_6.txt
     24000000 fileSales_unique.txt
     24000000 fileSales_unique_also.txt
     72000000 total
    

    There are many ways to achieve your goal, but two primary ones:

    1. Use a glob that only includes the desired files;
    2. Use an exclusion list or pattern that excludes the the undesired files.

    Inclusion glob:

    1. wc -l fileSales_{1..6}.txt
    2. wc -l fileSales_?.txt
    3. wc -l fileSales_[1-6].txt

    Any of those:

    $ wc -l fileSales_[1-6].txt  
     4000000 fileSales_1.txt
     4000000 fileSales_2.txt
     4000000 fileSales_3.txt
     4000000 fileSales_4.txt
     4000000 fileSales_5.txt
     4000000 fileSales_6.txt
     24000000 total
    

    (Same concept applies to awk)

    Or, maintain a skip array in Bash:

    skip=( *_unique* )
    to_cnt_files=()
    for fn in fileSales*.txt; do 
        [[ "${skip[@]/$fn/}" != "${skip[@]}" ]] && continue
        to_cnt_files+=( "$fn" )
    done
    

    Then your method works:

    awk 'END{print NR}' $(printf "%s\n" "${to_cnt_files[@]}")
    # 24000000
    

    Know that wc in this case will be monumentally faster than awk likely...