I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:
import polars as pl
df1 = pl.read_csv("test1.biobank.tsv.gz", separator='\t', schema_overrides={"#chrom": pl.String}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).unique()
I can read the remaining files one by one, i.e.:
df2 = pl.read_csv("test2.biobank.tsv.gz", separator='\t', schema_overrides={"#chrom": pl.String}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).unique()
check if all the values are not equal:
if not df1.equals(df2):
df = df1.vstack(df2)
del(df1)
del(df2)
then .unique()
. But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.
Thank you for your help in advance
DK
edit
There is another way to do it using Polars and DuckDB.
tsv_pattern = "gwas_*.gz"
for fn in glob.glob(tsv_pattern):
print(fn)
parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
df = pl.read_csv(fn, separator='\t', schema_overrides={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).unique()
df.to_parquet(parquet_fn, compression='zstd')
del(df)
CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'
Credits go to Mark Mytherin from DuckDB
You can use glob patterns to read the csv's and then call unique
(pl.scan_csv("**/*.csv")
.unique()
.collect())