unixawksedprogramming-pearls

Remove characters with pattern from a tab-delimited file


I have saveral files with pattern such as

NODE_1_length_59711_cov_84.026979_g0_i0_1 12.8
NODE_1_length_59711_cov_84.026979_g0_i0_2 18.9
NODE_2_length_59711_cov_84.026979_g0_i0_1 14.3
NODE_2_length_59711_cov_84.026979_g0_i0_2 16.1
NODE_165433_length_59711_cov_84.026979_g0_i0_1 29

I want to remove all characters from starting '1' to last '_'. so that I can get an output like this from multiple files-

1_1 12.8
1_2 18.9
2_1 14.3
2_2 16.1
165433_1 29

Solution

  • Using GNU awk:

    awk -F "\t" '{ fld1=gensub(/(^NODE_)([[:digit:]]+)(.*)([[:digit:]]+$)/,"\\2_\\4","g",$1);OFS=IFS;print fld1"\t"$2}' file
    

    Explanation:

    awk -F "\t" '{                                                       # Set the field separator to tab
                   fld1=gensub(/(^NODE_)([[:digit:]]+)(.*)([[:digit:]]+$)/,"\\2_\\4","g",$1);                                      # Split the first field into 4 sections represented in parenthesis and then substitute the line for the the second section, a "_" and then the fourth section. Read the result into a variable fld1
                   print fld1"\t"$2                                      # Print fld1, followed by a tab and then the second field.
                 }' file