pythonpython-polarsduckdb

create a polars dataframe containing unique values from a set of CSVs


I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:

import polars as pl

df1 = pl.read_csv("test1.biobank.tsv.gz", separator='\t', schema_overrides={"#chrom": pl.String}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).unique()

I can read the remaining files one by one, i.e.:

df2 = pl.read_csv("test2.biobank.tsv.gz", separator='\t', schema_overrides={"#chrom": pl.String}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).unique()

check if all the values are not equal:

if not df1.equals(df2):
    df = df1.vstack(df2)
    del(df1)
    del(df2)  

then .unique(). But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.

Thank you for your help in advance

DK

edit

There is another way to do it using Polars and DuckDB.

tsv_pattern = "gwas_*.gz"

for fn in glob.glob(tsv_pattern):
    print(fn)
    parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
    df = pl.read_csv(fn, separator='\t', schema_overrides={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).unique()
    df.to_parquet(parquet_fn, compression='zstd')
    del(df)

CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

Credits go to Mark Mytherin from DuckDB


Solution

  • You can use glob patterns to read the csv's and then call unique

    (pl.scan_csv("**/*.csv")
     .unique()
     .collect())