linuxfedoragoogle-genomicsbigdata

How to transform column to row in Terminal Linux? But more complex


How I transform column to row in Terminal Linux? But more complex... The following is an example of my data:

SNP_Name        ID_Animal        Allele        Chr        Position
rs01            215               AB            1            100
rs02            215               AA            2            200
rs03            215               BA            3            300
rs04            215               AA            4            400
rs01            300               AB            1            100
rs02            300               BB            2            200
rs03            300               AA            3            300
rs04            300               AB            4            400
rs01            666               BB            1            100
rs02            666               AA            2            200
rs03            666               AB            3            300
rs04            666               AB            4            400

I want to transform this into the following:

SNP_Name     Chr     Position   215(ID_animal)  300(ID_Animal) 666(ID_Animal)
rs01         1       100        AB              AB            BB
rs02         2       200        AA              BB            AA
rs03         3       300        BA              AA            AB
rs04         4       400        AA              AB            AB

The row ID_animal change in column with respective allele. How I do this? But I will work with 55,000 repetition per ID_animal. So, I want to be only 55,000 row and (animal number+SNP_Name+Chr+Position) of column.

Thank you.


Solution

  • The issue here is the amount of data, and I don't want to give a solution that reads everything into memory and then outputs it.

    To do this, I'd like to parse and output the data for each SNP (rs number) in turn rather than for each animal in turn. But the data is given to us in the wrong order (it's sorted by animal).

    So the first thing we need to do is to sort the data by SNP (the first column). I will also remove the header row at the same time as it is not needed for the data transformation.

    I'm assuming that the data is stored in the file data.in:

    $ sed '1d' data.in | sort -o data.tmp
    

    We now have:

    $ cat data.tmp
    rs01            215               AB            1            100
    rs01            300               AB            1            100
    rs01            666               BB            1            100
    rs02            215               AA            2            200
    rs02            300               BB            2            200
    rs02            666               AA            2            200
    rs03            215               BA            3            300
    rs03            300               AA            3            300
    rs03            666               AB            3            300
    rs04            215               AA            4            400
    rs04            300               AB            4            400
    rs04            666               AB            4            400
    

    Then I run the following to produce the result:

    $ awk -f script.awk data.tmp >data.new
    

    The awk script is quite long, so it makes sense to have it in it's own script file rather than as a "one-liner":

    FNR == 1    {
        # at first line of input
    
        rsid        = $1;
        chr         = $4;
        pos         = $5;
    
        c           = 0;
        aid[c]      = $2; # animal ID
        all[c++]    = $3; # allele
    
        do_header   = 1;  # output header when done parsing this SNP
    
        next;
    }
    
    rsid == $1 {
        # still reading animal ID/allele for this SNP
        aid[c]      = $2;
        all[c++]    = $3;
        next;
    }
    
    {
        if (do_header) {
            # output header
    
            printf("SNP_name\tChr\tPosition\t");
            for (c in aid) {
                printf("%d\t", aid[c]);
            }
            printf("\n");
    
            do_header = 0;
        }
    
        # output line with data from previous SNP    
        printf("%s\t%d\t%d\t", rsid, chr, pos);
        for (c in all) {
            printf("%s\t", all[c]);
        }
        printf("\n");
    
        # store data for this SNP
        rsid        = $1;
        chr         = $4;
        pos         = $5;
    
        c           = 0;
        aid[c]      = $2;
        all[c++]    = $3;
    }
    
    END {
        # output line for last SNP
    
        printf("%s\t%d\t%d\t", rsid, chr, pos);
        for (c in all) {
            printf("%s\t", all[c]);
        }
        printf("\n");
    }
    

    This produces the tab-delimited file data.new with the following contents, for the given input:

    SNP_name    Chr Position    215 300 666
    rs01    1   100 AB  AB  BB
    rs02    2   200 AA  BB  AA
    rs03    3   300 BA  AA  AB
    rs04    4   400 AA  AB  AB
    

    NOTE: This requires that all animals were genotyped for exactly the same SNPs. The same animal IDs needs to occur for every SNP. No exceptions.