bashloopswcvcf-variant-call-format

append output of each iteration of a loop to the same in bash


I have 44 files (2 for each chromosome) divided in two types: .vcf and .filtered.vcf. I would like to make a wc -l for each of them in a loop and append the output always to the same file. However, I would like to have 3 columns in this file: chr[1-22], wc -l of .vcf and wc -l of .filtered.vcf.

I've been trying to do independent wc -l for each file and paste together columnwise the 2 outputs for each of the chromosomes, but this is obviously not very efficient, because I'm generating a lot of unnecessary files. I'm trying this code for the 22 pairs of files:

wc -l file1.vcf | cut -f 1 > out1.vcf
wc -l file1.filtered.vcf | cut -f 1 > out1.filtered.vcf
paste -d "\t" out1.vcf out1.filtered.vcf

I would like to have just one output file containing three columns:

Chromosome    VCFCount    FilteredVCFCount
chr1          out1        out1.filtered
chr2          out2        out2.filtered

Any help will be appreciated, thank you very much in advance :)


Solution

  • printf "%s\n" *.filtered.vcf |
    cut -d. -f1 |
    sort |
    xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.vcf")" "$(wc -l <"${1}.filtered.vcf")"' -- 
    
    1. Output newline separated list of files in the directory
    2. Remove the extension with cut (probably something along xargs -i basename {} .filtered.vcf would be safer)
    3. Sort it (for nice sorted output!) (probably something along sort -tr -k2 -n would sort numerically and would be even better).
    4. xargs -n1 For each one file execute the script sh -c
      1. printf "%s\t%s\t%s\n" - output with custom format string ...
      2. "$1" - the filename and...
      3. "(wc -l <"${1}.vcf")" - the count the lines in .vcf file and...
      4. "$(wc -l <"${1}.filtered.vcf")" - the count of the lines in the .filtered.vcf

    Example:

    > touch chr{1..3}{,.filtered}.vcf 
    > echo > chr1.filtered.vcf ; echo  > chr2.vcf ; 
    >     printf "%s\n" *.filtered.vcf |
    >    cut -d. -f1 |
    >    sort |
    >    xargs -n1 sh -c 'printf "%s\t%s\t%s\n" "$1" "$(wc -l <"${1}.filtered.vcf")" "$(wc -l <"${1}.vcf")"' -- 
    chr1    0   1
    chr2    1   0
    chr3    0   0
    

    To have nice looking table with headers, use column:

    > .... | column -N Chromosome,VCFCount,FilteredVCFCount -t -o '    '
    Chromosome    VCFCount    FilteredVCFCount
    chr1          0           1
    chr2          1           0
    chr3          0           0