How I transform column to row in Terminal Linux? But more complex... The following is an example of my data:
SNP_Name ID_Animal Allele Chr Position
rs01 215 AB 1 100
rs02 215 AA 2 200
rs03 215 BA 3 300
rs04 215 AA 4 400
rs01 300 AB 1 100
rs02 300 BB 2 200
rs03 300 AA 3 300
rs04 300 AB 4 400
rs01 666 BB 1 100
rs02 666 AA 2 200
rs03 666 AB 3 300
rs04 666 AB 4 400
I want to transform this into the following:
SNP_Name Chr Position 215(ID_animal) 300(ID_Animal) 666(ID_Animal)
rs01 1 100 AB AB BB
rs02 2 200 AA BB AA
rs03 3 300 BA AA AB
rs04 4 400 AA AB AB
The row ID_animal
change in column with respective allele. How I do this?
But I will work with 55,000 repetition per ID_animal
. So, I want to be only 55,000 row and (animal number
+SNP_Name
+Chr
+Position
) of column.
Thank you.
The issue here is the amount of data, and I don't want to give a solution that reads everything into memory and then outputs it.
To do this, I'd like to parse and output the data for each SNP (rs
number) in turn rather than for each animal in turn. But the data is given to us in the wrong order (it's sorted by animal).
So the first thing we need to do is to sort the data by SNP (the first column). I will also remove the header row at the same time as it is not needed for the data transformation.
I'm assuming that the data is stored in the file data.in
:
$ sed '1d' data.in | sort -o data.tmp
We now have:
$ cat data.tmp
rs01 215 AB 1 100
rs01 300 AB 1 100
rs01 666 BB 1 100
rs02 215 AA 2 200
rs02 300 BB 2 200
rs02 666 AA 2 200
rs03 215 BA 3 300
rs03 300 AA 3 300
rs03 666 AB 3 300
rs04 215 AA 4 400
rs04 300 AB 4 400
rs04 666 AB 4 400
Then I run the following to produce the result:
$ awk -f script.awk data.tmp >data.new
The awk
script is quite long, so it makes sense to have it in it's own script file rather than as a "one-liner":
FNR == 1 {
# at first line of input
rsid = $1;
chr = $4;
pos = $5;
c = 0;
aid[c] = $2; # animal ID
all[c++] = $3; # allele
do_header = 1; # output header when done parsing this SNP
next;
}
rsid == $1 {
# still reading animal ID/allele for this SNP
aid[c] = $2;
all[c++] = $3;
next;
}
{
if (do_header) {
# output header
printf("SNP_name\tChr\tPosition\t");
for (c in aid) {
printf("%d\t", aid[c]);
}
printf("\n");
do_header = 0;
}
# output line with data from previous SNP
printf("%s\t%d\t%d\t", rsid, chr, pos);
for (c in all) {
printf("%s\t", all[c]);
}
printf("\n");
# store data for this SNP
rsid = $1;
chr = $4;
pos = $5;
c = 0;
aid[c] = $2;
all[c++] = $3;
}
END {
# output line for last SNP
printf("%s\t%d\t%d\t", rsid, chr, pos);
for (c in all) {
printf("%s\t", all[c]);
}
printf("\n");
}
This produces the tab-delimited file data.new
with the following contents, for the given input:
SNP_name Chr Position 215 300 666
rs01 1 100 AB AB BB
rs02 2 200 AA BB AA
rs03 3 300 BA AA AB
rs04 4 400 AA AB AB
NOTE: This requires that all animals were genotyped for exactly the same SNPs. The same animal IDs needs to occur for every SNP. No exceptions.