linuxbashpivotgenomegwas

Pivotting the SNP Table , Converting CSV file to JSON using Bash


I'm working with GWAS data. Need help.

My data looks like this:

IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.

As above I have a total of 512 rows and 2 Million columns.

Desired output:

SNP,Genotyping
rs098083,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 54, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}     
kgp794789,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.

}
rs09848309,{
"CC" : [ 1, 63, 6, 18, 3, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 24 13, ...],
"AA" : [86, 134, 4, 19, ...],
"AT" : [8, 48, 34, 44, ....],
.
.
.

As above after pivoting, I should have a JSON file of 2 million rows & 2 Columns. The SNP column of the row contains the ID of the SNP. The genotyping column will contain a JSON BLOB. This BLOB will be a set of key-value pairs. The key is a particular genotype (e.g., CC, CT, TT, ....) and the value is a list of the IIDs with a genotype matching the key.

Output Format would be " a CSV with embedded JSON"


Solution

  • Here's an approach using stedolan/jq:

    jq -Rrn '
      [ inputs / "," ] | transpose | .[0][1:] as $h | .[1:][]
      | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
      | join(", ")
    '
    
    rs098083, {"CC":["63"],"AT":["54"],"TT":["12"]}
    kgp794789, {"AG":["63"],"CT":["54"],"GA":["12"]}
    rs09848309, {"GA":["63"],"TT":["54"],"AG":["12"]}
    kgp8300747, {"AA":["63","12"],"AG":["54"]}
    

    Demo

    Add tonumber if the IDs should be encoded as JSON numbers

    jq -Rrn '
      [ inputs / "," ] | transpose | (.[0][1:] | map(tonumber)) as $h | .[1:][]
      | .[1:] |= [reduce ([.,$h] | transpose[]) as $t ({}; .[$t[0]] += [$t[1]]) | @text]
      | join(", ")
    '
    
    rs098083, {"CC":[63],"AT":[54],"TT":[12]}
    kgp794789, {"AG":[63],"CT":[54],"GA":[12]}
    rs09848309, {"GA":[63],"TT":[54],"AG":[12]}
    kgp8300747, {"AA":[63,12],"AG":[54]}
    

    Demo


    If your ultimate goal is to have a JSON representation anyways, omit formatting the raw output, and something like this might do:

    jq -Rn '
      [ inputs / "," ] | transpose | .[0][1:] as $h | reduce .[1:][] as $t (
        {}; .[$t[0]] = reduce ([$t[1:],$h] | transpose[]) as $i (
          {}; .[$i[0]] += [$i[1]]
        )
      )
    '
    
    {
      "rs098083": { "CC": ["63"], "AT": ["54"], "TT": ["12"] },
      "kgp794789": { "AG": ["63"], "CT": ["54"], "GA": ["12"] },
      "rs09848309": { "GA": ["63"], "TT": ["54"], "AG": ["12"] },
      "kgp8300747": { "AA": ["63", "12"], "AG": ["54"] }
    }
    

    Demo (formatted manually for easier comaprison with previous solutions)