pythonsqlperformancesqliteapsw

Matching row values obtained from first table to the column names of second table without loop


It will be really helpful for me if someone can solve the following problem. I have two tables and a requirement-

Required SQL script:

Without using loops, an efficient SQL script to match row values (sampleID) obtained from first table to the column names (X2, X4, X8...) Of second table and obtain the values for given Gene_ID (e.g. NFYA).

Expected result:

X1  15.2856
X10 18.2201
X14 13.3406
. .
. .

Tableone (rows X columns: 135 X 32)

Table description (partial rows and columns)

+-------+--------------------+--------+----------+-------+---------+
| Batch | filename_generate  | pcode  | SampleID | check | Diagnos |
+-------+--------------------+--------+----------+-------+---------+
| B     | cufflinks_out_2_B  | 01-111 | X2       | TRUE  | RH      |
| D     | cufflinks_out_4D   | 01-163 | X4       | TRUE  | RH      |
| B     | cufflinks_out_5_B  | 01-166 | X5       | TRUE  | RH      |
| D     | cufflinks_out_6D   | 02-007 | X6       | TRUE  | RH      |
| C     | cufflinks_out_8C   | 02-012 | X8       | TRUE  | RH      |
| C     | cufflinks_out_9C   | 02-014 | X9       | TRUE  | RH      |
| B     | cufflinks_out_10_B | 02-017 | X10      | TRUE  | RH      |
| B     | cufflinks_out_13_B | 02-030 | X13      | TRUE  | ON     |
| D     | cufflinks_out_14D  | 02-031 | X14      | TRUE  | RH      |
| B     | cufflinks_out_15B  | 02-037 | X15      | TRUE  | RH      |
| C     | cufflinks_out_16C  | 02-038 | X16      | TRUE  | IS     |
| B     | cufflinks_out_17_B | 02-041 | X17      | TRUE  | ON     |
| B     | cufflinks_out_19_B | 02-050 | X19      | TRUE  | ON     |
| B     | cufflinks_out_20_B | 02-056 | X20      | TRUE  | RH      |
+-------+--------------------+--------+----------+-------+---------+

Tabletwo (rows X columns: 56000 X 137)

Table description (partial rows and columns)

+-----------------+----------+---------+---------+----------+----------+----------+----------+---------+
|   Ensembl_ID    | Gene_ID  |   X1    |   X10   |   X13    |   X14    |   X15    |   X16    |   X17   |
+-----------------+----------+---------+---------+----------+----------+----------+----------+---------+
| ENSG00000000003 | TSPAN6   | 1.388   | 0.443   | 0.563    | 0.350    | 0.390    | 0.220    | 0.528   | 
| ENSG00000000005 | TNMD     | 0       | 0       | 0        | 0        | 0        | 0        | 0       |
| ENSG00000000419 | DPM1     | 34.309  | 40.2635 | 28.8669  | 21.0556  | 18.1733  | 22.0223  | 25.4352 |
| ENSG00000000457 | SCYL3    | 7.84987 | 9.35551 | 7.45483  | 7.1601   | 6.53686  | 7.26445  | 6.30419 |
| ENSG00000000460 | C1orf112 | 2.36851 | 3.76825 | 3.10324  | 2.0262   | 1.84606  | 3.01185  | 3.02763 |
| ENSG00000000938 | FGR      | 227.024 | 222.578 | 247.124  | 234.995  | 255.226  | 265.288  | 323.6   |
| ENSG00000000971 | CFH      | 3.17952 | 3.60279 | 1.68429  | 3.74301  | 2.10637  | 0.763278 | 1.78278 |
| ENSG00000001036 | FUCA2    | 16.5566 | 19.1703 | 25.5005  | 18.5244  | 20.7771  | 18.353   | 25.2364 |
| ENSG00000001084 | GCLC     | 9.45121 | 16.1362 | 12.6239  | 13.1074  | 10.6472  | 18.9938  | 12.8249 |
| ENSG00000001167 | NFYA     | 15.2856 | 18.2201 | 12.4789  | 13.3406  | 15.0146  | 13.2608  | 11.5385 |
| ENSG00000001460 | C1orf201 | 1.64558 | 1.93322 | 0.7927   | 1.71796  | 2.27997  | 0.938738 | 1.3911  |
| ENSG00000001461 | NIPAL3   | 14.6073 | 14.1772 | 11.1503  | 12.5077  | 15.1269  | 13.6     | 11.227  |
| ENSG00000001497 | LAS1L    | 14.4519 | 15.3965 | 11.8901  | 16.8572  | 16.7174  | 14.6004  | 15.7266 |
+-----------------+----------+---------+---------+----------+----------+----------+----------+---------+

Script written in Python, module used SQLite3

I am adding the full code.

import os, sys, time
import sqlite3   
import apsw  
disk_db = apsw.Connection('sampleinfogenotype.db')  
memcon=apsw.Connection(":memory:")  
with memcon.backup("main",disk_db, "main") as backup:  
    backup.step() # copy whole database in one go  

mdata=memcon.cursor()  
for row in memcon.cursor().execute("SELECT tableone.SampleID from tableone WHERE   tableone.Diagnos=='RH'"):
    sampleID_row=str(row[0])  
    sqlscript="SELECT "+sampleID_row+ " FROM tabletwo WHERE tabletwo.Gene_ID=='NFYA'"  
    data=memdata.execute(sqlscript).fetchall()[0]  
    print sampleID_row,data[0]  
memcon.close()  
disk_db.close()

It gives the expected result but it is time consuming, since there is a loop! Is there an efficient SQL script or method... any information will be helpful.


Solution

  • You don't need to re-query the second table over and over again, you could just retrieve the one row once and use that as a lookup-table. If you use the sqlite3.Row row factory that is as easy as retrieving one row:

    memcon.row_factory = sqlite3.Row
    memdata.execute("select * from tabletwo where tabletwo.Gene_ID == 'NFYA'")
    nfya_row = memdata.fetchone()
    
    memdata.execute("SELECT tableone.SampleID from tableone WHERE tableone.Diagnos=='RH'")
    for row in memdata:
        sampleID = row['SampleID']
        print sampleID, nfya_row[sampleID]
    

    The nfya_row value now holds one sqlite3.Row object, whose keys are the column names. When now looping over the tabelone results, you can look up the value of any given column directly.

    If you are using the apsw module instead, then there is no easy ready-made row factory that returns a dictionary, but you can easily build the dictionary yourself for one row:

    memdata.execute("select * from tabletwo where tabletwo.Gene_ID == 'NFYA'")
    nfya_row = {desc[0]: column for desc, column in zip(memdata.getdescription(), memdata.next())}
    
    memdata.execute("SELECT tableone.SampleID from tableone WHERE tableone.Diagnos=='RH'")
    for row in memdata:
        sampleID = row[0]
        print sampleID, nfya_row[sampleID]