pythonpandascsvpython-polarsbcftools

Why can I read a file with pandas but not with polars?


I have a CSV (or rather TSV) I got from stripping the header off a gVCF with

bcftools view foo.g.vcf -H > foo.g.vcf.csv

A head gives me this, so everything looks like expected so far

chr1H   1       .       T       <*>     0       .       END=1000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   1001    .       T       <*>     0       .       END=1707        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   1708    .       C       <*>     0       .       END=1763        GT:GQ:MIN_DP:PL 0/0:6:2:0,6,59
chr1H   1764    .       T       <*>     0       .       END=2000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   2001    .       A       <*>     0       .       END=3000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   3001    .       G       <*>     0       .       END=4000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   4001    .       T       <*>     0       .       END=5000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   5001    .       T       <*>     0       .       END=6000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   6001    .       A       <*>     0       .       END=7000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H   7001    .       G       <*>     0       .       END=8000        GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0

When I know try to read the file as a dataframe in a Jupyter Notebook like this

import polars as pl

df = pl.read_csv("foo.g.vcf.csv", has_header=False,
                 new_columns=["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "SAMPLE"],
                 separator="\t")

I get a compute error "Original error: remaining bytes non-empty". However, when I do

import pandas as pd
import polars as pl

df = pd.read_csv("foo.g.vcf.csv", header=None, sep="\t",
                 names=["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "SAMPLE"])
df = pl.DataFrame(df)

every works as intended.

Why can I read with pandas without problems and convert to polars, but not read with polars directly?

The other VCF I want to compare with, which I stripped the same way, works with polars.


Solution

  • Looks like you might have empty trailing spaces. Hence the error:

    Original error: remaining bytes non-empty
    

    Polars is stricter than pandas on file formatting. Pandas will infer formatting but Polars will not.

    You can use this command to remove empty lines and white spaces:

    sed -i '/^\s*$/d' foo.g.vcf.csv
    

    But I recommend you tell Polars to infer the schema from the whole file instead with:

    infer_schema_length=None
    

    Or you can tell Polars to ignore parsing errors (I do not recommend this but it is an option) with:

    ignore_errors=True