pythonpython-polars

Issue with reading a CSV file with all columns as string using polars


I have below Python code using polars, and I do not want Python to auto parse values as dates or integers unless explicitly stated. schema_overrides doesn't prevent auto conversion either.

import polars as pl

# Read the CSV file with all columns as strings using schema_overrides
file_path = "./xyz.csv"
df = pl.read_csv(file_path, schema_overrides={'*': pl.Utf8})

# Display the DataFrame
print(df)

I get below error:

polars.exceptions.ComputeError: could not parse p35038 as dtype i64 at column 'Employee ID' (column number 3)


Solution

  • This is what infer_schema=False is for.

    When False, the schema is not inferred and will be pl.String if not specified in schema or schema_overrides.

    pl.read_csv(b"""a,b,c
     1,2,3""")
    
    # shape: (1, 3)
    # ┌─────┬─────┬─────┐
    # │ a   ┆ b   ┆ c   │
    # │ --- ┆ --- ┆ --- │
    # │ i64 ┆ i64 ┆ i64 │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 3   │
    # └─────┴─────┴─────┘
    
    pl.read_csv(b"""a,b,c
    1,2,3""", infer_schema=False)
    
    # shape: (1, 3)
    # ┌─────┬─────┬─────┐
    # │ a   ┆ b   ┆ c   │
    # │ --- ┆ --- ┆ --- │
    # │ str ┆ str ┆ str │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 3   │
    # └─────┴─────┴─────┘
    

    "*" in your example is taken literally, it is not treated as a "Wildcard".