I'm trying to create random pairs from a column using DuckDB.
I have a column of protein accession numbers which looks like this:
┌──────────────┐
│ protein_upkb │
│ varchar │
├──────────────┤
│ G1XNZ0 │
│ G1XP19 │
│ G1XP66 │
│ G1XP70 │
│ G1XPL1 │
│ G1XPQ7 │
│ G1XQ23 │
│ G1XQ44 │
│ G1XQ89 │
│ G1XQH2 │
├──────────────┤
│ 10 rows │
└──────────────┘
I'm trying to create random pairs of these protein ids such that they look like this:
┌────────────┬────────────┐
│ p1 │ p2 │
│ varchar │ varchar │
├────────────┼────────────┤
│ G1XNZ0 │ G1XP19 │
│ G1XP19 │ G1XP66 │
│ G1XP66 │ G1XP70 │
│ G1XP70 │ G1XPL1 │
│ G1XPL1 │ G1XPQ7 │
│ G1XPQ7 │ G1XQ23 │
│ G1XQ23 │ G1XQ44 │
│ G1XQ44 │ G1XQ89 │
│ G1XQ89 │ G1XQH2 │
│ G1XQH2 │ G1XNZ0 │
├────────────┴────────────┤
│ 10 rows 2 columns │
└─────────────────────────┘
N.B: This is just an example, I've thousands of IDs in the table in question.
I began by scrambling the order of the proteins by assigning some random number to each row and sorting by it.
CREATE VIEW proteins AS
SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet');
SELECT * FROM proteins ORDER BY x DESC LIMIT 10;
Which results in
┌──────────────┬────────────────────┐
│ protein_upkb │ x │
│ varchar │ double │
├──────────────┼────────────────────┤
│ A0A1H6HM63 │ 0.9999986232724041 │
│ A0A1G6CK58 │ 0.9999978158157319 │
│ A0A2C5XBA3 │ 0.9999923389405012 │
│ A0A1H9T955 │ 0.9999855090864003 │
│ Q05Q16 │ 0.9999655580613762 │
│ R5PE70 │ 0.999956940067932 │
│ R5GUN0 │ 0.9999453630298376 │
│ A0A0L0UJ42 │ 0.9999357375781983 │
│ W6ZJY1 │ 0.9999311361461878 │
│ F6D0F2 │ 0.9999301459174603 │
├──────────────┴────────────────────┤
│ 10 rows 2 columns │
└───────────────────────────────────┘
I then tried to create random pairs using subqueries. One column would be sorted by x
descending, the other by x
ascending.
Confusingly (to me), this only creates one random pair rather than the 255,622 I both expected and need.
cursor = duckdb.sql("""
SELECT
(SELECT protein_upkb FROM proteins ORDER BY x DESC) as p1,
(SELECT protein_upkb FROM proteins ORDER BY x ASC) as p2,
LIMIT 10;
""").show()
┌─────────┬─────────┐
│ p1 │ p2 │
│ varchar │ varchar │
├─────────┼─────────┤
│ Q28RH7 │ D8LJ06 │
└─────────┴─────────┘
I figured that I can create two VIEWs, proteins1
and proteins2
. I can then independently randomly sort them using random()
as I've done before.
Finally, I can create pairs by selecting the protein_upkb
column from each table.
Once more, I'm a bit surprised by the outcome.
p2
is a sequence of random proteins, while p1
is just one of the proteins.
cursor = duckdb.sql("""
CREATE VIEW proteins1 AS
SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet')
ORDER BY x ASC;
CREATE VIEW proteins2 AS
SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet')
ORDER BY x ASC;
SELECT ps1.protein_upkb as p1, ps2.protein_upkb as p2,
FROM proteins1 as ps1, proteins2 as ps2
LIMIT 10;
""").show()
┌────────────┬────────────┐
│ p1 │ p2 │
│ varchar │ varchar │
├────────────┼────────────┤
│ A0A394DPL7 │ A0A1I3L166 │
│ A0A394DPL7 │ A0A0Q3WJP1 │
│ A0A394DPL7 │ A0A093SP34 │
│ A0A394DPL7 │ A0A127EQY9 │
│ A0A394DPL7 │ K6UP11 │
│ A0A394DPL7 │ A0A1I6M9F9 │
│ A0A394DPL7 │ A0A0Q3SWF8 │
│ A0A394DPL7 │ A0A069RD68 │
│ A0A394DPL7 │ S9ZHA8 │
│ A0A394DPL7 │ Q5P5L0 │
├────────────┴────────────┤
│ 10 rows 2 columns │
└─────────────────────────┘
You can test this out in this Colab notebook.
There are multiple ways of doing it. The exact way would depend on your additional requirements.
This one is very simple, get random sample twice, concatenate them together. You can get same protein twice, but you can exclude it with additional where
clause though.
duckdb.sql("""
with cte as (
select protein_upkb from proteins using sample(10)
)
select *
from cte as c1
positional join cte as c2
""")
┌──────────────┬──────────────┐
│ protein_upkb │ protein_upkb │
│ varchar │ varchar │
├──────────────┼──────────────┤
│ A0A0F6TCU1 │ A0A4C1ULV9 │
│ D4YJT4 │ A0A3Q3FTK5 │
│ A0A319DTU8 │ C6LIN2 │
│ A0A1Q3D9X9 │ A0A1B3BCY8 │
│ M5F4R3 │ M1NUZ3 │
│ A0A553PJQ2 │ A0A165P0W9 │
│ G7M9F2 │ A0A182JZX3 │
│ A0A0Q1CIG2 │ G3HMK9 │
│ C7YU85 │ A0A3Q2E7T6 │
│ A0A199VI77 │ A0A0R1JQR6 │
├──────────────┴──────────────┤
│ 10 rows 2 columns │
└─────────────────────────────┘
Or assign row_number at random, and then pivot in a way that even an odd rows create pairs:
duckdb.sql("""
with cte1 as (
select ps1.protein_upkb, row_number() over(order by random()) as rn
from proteins as ps1
), cte2 as (
select
protein_upkb,
rn % 2 as col,
rn // 2 as r
from cte1
)
pivot cte2
on col
using any_value(protein_upkb)
limit 10
""")
───────┬────────────┬────────────┐
│ r │ 0 │ 1 │
│ int64 │ varchar │ varchar │
├───────┼────────────┼────────────┤
│ 66322 │ A0A1N7AVA0 │ A0A175R4H7 │
│ 66325 │ K9FKM7 │ D8QP02 │
│ 66327 │ A0A1I5KRT3 │ W0V524 │
│ 66328 │ A0A4U2YU79 │ A0A452RP46 │
│ 66334 │ A8RCK1 │ A0A165U1L8 │
│ 66335 │ A0A3Q3QVI9 │ C7MCJ1 │
│ 66336 │ Q3SLR9 │ A0A3B4B0Q2 │
│ 66338 │ A0A1W1XBB2 │ A0A0B7J5C1 │
│ 66339 │ A0A1I4KH70 │ A0A3S4SEU1 │
│ 66340 │ A0A1W0D573 │ Q4ZR49 │
├───────┴────────────┴────────────┤
│ 10 rows 3 columns │
└─────────────────────────────────┘