I have a CSV (or rather TSV) I got from stripping the header off a gVCF with
bcftools view foo.g.vcf -H > foo.g.vcf.csv
A head
gives me this, so everything looks like expected so far
chr1H 1 . T <*> 0 . END=1000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 1001 . T <*> 0 . END=1707 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 1708 . C <*> 0 . END=1763 GT:GQ:MIN_DP:PL 0/0:6:2:0,6,59
chr1H 1764 . T <*> 0 . END=2000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 2001 . A <*> 0 . END=3000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 3001 . G <*> 0 . END=4000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 4001 . T <*> 0 . END=5000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 5001 . T <*> 0 . END=6000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 6001 . A <*> 0 . END=7000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
chr1H 7001 . G <*> 0 . END=8000 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
When I know try to read the file as a dataframe in a Jupyter Notebook like this
import polars as pl
df = pl.read_csv("foo.g.vcf.csv", has_header=False,
new_columns=["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "SAMPLE"],
separator="\t")
I get a compute error "Original error: remaining bytes non-empty
". However, when I do
import pandas as pd
import polars as pl
df = pd.read_csv("foo.g.vcf.csv", header=None, sep="\t",
names=["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO", "FORMAT", "SAMPLE"])
df = pl.DataFrame(df)
every works as intended.
Why can I read with pandas without problems and convert to polars, but not read with polars directly?
The other VCF I want to compare with, which I stripped the same way, works with polars.
Looks like you might have empty trailing spaces. Hence the error:
Original error: remaining bytes non-empty
Polars is stricter than pandas on file formatting. Pandas will infer formatting but Polars will not.
You can use this command to remove empty lines and white spaces:
sed -i '/^\s*$/d' foo.g.vcf.csv
But I recommend you tell Polars to infer the schema from the whole file instead with:
infer_schema_length=None
Or you can tell Polars to ignore parsing errors (I do not recommend this but it is an option) with:
ignore_errors=True