Goal: To print fields from an input file to an output file, but with special regard to specific fields split by semicolons (;
— see field nine of example input below).
Example Input (input.txt):
NC_051336.1 Gnomon gene 40042 56215 . - . gene_id "LOC102552844"; transcript_id ""; db_xref "GeneID:102552844"; db_xref "RGD:7551986"; gbkey "Gene"; gene "LOC102552844"; gene_biotype "pseudogene"; pseudo "true";
NC_051336.1 BestRefSeq%2CGnomon gene 76909 85762 . + . gene_id "Vom2r3"; transcript_id ""; db_xref "GeneID:502213"; db_xref "RGD:1565892"; description "vomeronasal 2 receptor, 3"; gbkey "Gene"; gene "Vom2r3"; gene_biotype "protein_coding"; gene_synonym "RGD1565892";
NC_051336.1 Gnomon gene 162525 192445 . - . gene_id "LOC103693496"; transcript_id ""; db_xref "GeneID:103693496"; db_xref "RGD:9090555"; gbkey "Gene"; gene "LOC103693496"; gene_biotype "protein_coding";
NC_051336.1 Gnomon transcript 162525 167758 . - . gene_id "LOC103693496"; transcript_id "XM_039098304.1"; db_xref "GeneID:103693496"; gbkey "mRNA"; gene "LOC103693496"; model_evidence "Supporting evidence includes similarity to: 4 Proteins, and 90% coverage of the annotated genomic feature by RNAseq alignments"; product "sperm motility kinase X-like, transcript variant X9"; transcript_biotype "mRNA";
NC_051336.1 BestRefSeq gene 226465 241532 . - . gene_id "Vom2r4"; transcript_id ""; db_xref "GeneID:308248"; db_xref "RGD:1564110"; description "vomeronasal 2 receptor, 4"; gbkey "Gene"; gene "Vom2r4"; gene_biotype "protein_coding"; gene_synonym "RGD1564110";
NC_051336.1 BestRefSeq transcript 226465 241532 . - . gene_id "Vom2r4"; transcript_id "NM_001099458.1"; db_xref "GeneID:308248"; gbkey "mRNA"; gene "Vom2r4"; product "vomeronasal 2 receptor, 4"; transcript_biotype "mRNA";
NC_051336.1 Curated Genomic gene 267100 275769 . - . gene_id "Vom2r-ps8"; transcript_id ""; db_xref "GeneID:502214"; db_xref "RGD:1563053"; description "vomeronasal 2 receptor, pseudogene 8"; gbkey "Gene"; gene "Vom2r-ps8"; gene_biotype "pseudogene"; gene_synonym "RGD1563053"; pseudo "true";
NC_051336.1 Gnomon gene 284195 301267 . - . gene_id "LOC102556157"; transcript_id ""; db_xref "GeneID:102556157"; db_xref "RGD:7626961"; gbkey "Gene"; gene "LOC102556157"; gene_biotype "protein_coding";
NC_051336.1 Gnomon gene 307758 313908 . - . gene_id "LOC108352169"; transcript_id ""; db_xref "GeneID:108352169"; db_xref "RGD:11507047"; gbkey "Gene"; gene "LOC108352169"; gene_biotype "lncRNA";
NC_051336.1 Gnomon transcript 307758 313908 . - . gene_id "LOC108352169"; transcript_id "XR_005497081.1"; db_xref "GeneID:108352169"; gbkey "ncRNA"; gene "LOC108352169"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC108352169, transcript variant X3"; transcript_biotype "lnc_RNA";
Desired output (output.txt):
LOC102552844 LOC102552844 NC_051336.1:40042-56215 NC_051336.1 40042 56215 - 16173 pseudogene
Vom2r3 Vom2r3 NC_051336.1:76909-85762 NC_051336.1 76909 85762 + 8853 protein_coding
LOC103693496 LOC103693496 NC_051336.1:162525-192445 NC_051336.1 162525 192445 - 29920 protein_coding
Vom2r4 Vom2r4 NC_051336.1:226465-241532 NC_051336.1 226465 241532 - 15067 protein_coding
Vom2r-ps8 Vom2r-ps8 NC_051336.1:267100-275769 NC_051336.1 267100 275769 - 8669 pseudogene
LOC102556157 LOC102556157 NC_051336.1:284195-301267 NC_051336.1 284195 301267 - 17072 protein_coding
LOC108352169 LOC108352169 NC_051336.1:307758-313908 NC_051336.1 307758 313908 - 6150 lncRNA
For only those lines of the input where $3 is "gene"...
Column 1: "gene_id" substring of $9
Column 2: "gene" substring of $9
Column 3: $1 ":" $4 "-" $5
Column 4: $1
Column 5: $4
Column 6: $5
Column 7: $7
Column 8: $5-$4
Column 9: "gene_biotype" substring of $9
Attempts:
$ cat input.txt | awk ' # provide the input to awk
BEGIN{FS="\t"} # use tab as the field separator
{split($9,a,";"); # split field 9 into individual fields using semicolon as the field separator
if($3~"gene") # if the third field is "gene"
print a[1]"\t"a[6]"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"a[7]}' | # print the indicated fields of each line in a tab-delimited fashion
sed 's/gene_id "//' | # remove any instance of the string gene_id "
sed 's/gene "//' | # remove any instance of the string gene "
sed 's/gene_biotype "//' | # remove any instance of the string gene_biotype "
sed 's/[" ]//g' > output.txt # remove any instance of the character " globally and send output to file
$ cat output.txt
LOC102552844 LOC102552844 NC_051336.1:40042-56215 NC_051336.1 40042 56215 - 16173 pseudogene
Vom2r3 gbkeyGene NC_051336.1:76909-85762 NC_051336.1 76909 85762 + 8853 Vom2r3
LOC103693496 LOC103693496 NC_051336.1:162525-192445 NC_051336.1 162525 192445 - 29920 protein_coding
Vom2r4 gbkeyGene NC_051336.1:226465-241532 NC_051336.1 226465 241532 - 15067 Vom2r4
Vom2r-ps8 gbkeyGene NC_051336.1:267100-275769 NC_051336.1 267100 275769 - 8669 Vom2r-ps8
LOC102556157 LOC102556157 NC_051336.1:284195-301267 NC_051336.1 284195 301267 - 17072 protein_coding
LOC108352169 LOC108352169 NC_051336.1:307758-313908 NC_051336.1 307758 313908 - 6150 lncRNA
This comes really close to what I want... except that output lines 2,4,5 are incorrect. I want the second output field to be a substring of the "gene "
" input field and the ninth output field to be a substring of the "gene_biotype "
" input field (both desired input fields have variable positions in the semicolon separated series of the ninth field).
While printing a[1]
to extract the "gene_id "
" field does work since it is ALWAYS in the first field of the ninth semicolon-delimited input field, printing a[6]
and a[7]
to extract the other fields mentioned above will NOT work because their index position changes for some lines (see 2,4,5 of the output). The obvious solution is to print a field only when it contains the string of interest rather than a specific index position...
$ cat $ANNOTATION | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}"\t"{for(i=1;i<=NF;i++){if($i~/gene "/){print a[$i]}}}"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"{for(i=1;i<=NF;i++){if($i~/gene_biotype "/){print a[$i]}}}}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt
This awk command is riddled with syntax errors since using {
or }
or ;
after print
is not acceptable, thus the command is not executed and there is no output.
The general idea was to replace the index positions with awk-friendly regex code to extract fields containing specific substrings: {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}
. However, this will not work if the syntax is improper, like above.
Note: I am using...
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
GNU Awk 4.0.2
sed (GNU sed) 4.2.2
Assumptions:
gene
, gene_id
and gene_biotype
(otherwise this answer is going to print an empty string where the missing entry would show up in the output)One awk
idea:
awk '
BEGIN { FS=OFS="\t" }
$3 == "gene" { delete attr # reset attr[] array
split($NF,a,";") # split last field on ";"
for (i in a) { # loop through array
split(a[i],b,"\"") # split array entry on double quote; result: b[1]==attribute name, b[2]==attribute value; we do not care about b[3]
gsub(/ /,"",b[1]) # strip spaces from the attribute name
attr[b[1]]=b[2] # populate attr[] array
}
print attr["gene_id"],attr["gene"],$1":"$4"-"$5,$1,$4,$5,$7,($5-$4),attr["gene_biotype"]
}
' input.tsv
This generates:
LOC102552844 LOC102552844 NC_051336.1:40042-56215 NC_051336.1 40042 56215 - 16173 pseudogene
Vom2r3 Vom2r3 NC_051336.1:76909-85762 NC_051336.1 76909 85762 + 8853 protein_coding
LOC103693496 LOC103693496 NC_051336.1:162525-192445 NC_051336.1 162525 192445 - 29920 protein_coding
Vom2r4 Vom2r4 NC_051336.1:226465-241532 NC_051336.1 226465 241532 - 15067 protein_coding
Vom2r-ps8 Vom2r-ps8 NC_051336.1:267100-275769 NC_051336.1 267100 275769 - 8669 pseudogene
LOC102556157 LOC102556157 NC_051336.1:284195-301267 NC_051336.1 284195 301267 - 17072 protein_coding
LOC108352169 LOC108352169 NC_051336.1:307758-313908 NC_051336.1 307758 313908 - 6150 lncRNA