pythonpython-polarsread-csv

Polars read_csv ignore_errors what to do if you can't ignore them?


Using polars.read_csv on a large data set results in a failure because of a field delimiter issue. Ignore_errors skips the erroneous records, but I have no idea if one or thousands of records were ignored. Is there a way to pipe the bad records to a bad file or report the number of ignored rows?

I wish the world was simple enough for data to support single character column delimiters, but that hasn't happened yet - why doesn't pandas/pyarrow/polars support multi character field delimiters?


Solution

  • Polars library doesn't provide a mechanism to pipe the bad records to a separate file or report the number of ignored rows when using the ignore_errors parameter. You could do it manually in the following way but I don't know if it's what you want:

    import polars as pl
    
    csv_file = "path/to/your/file.csv"
    
    bad_records = pl.DataFrame() # empty DataFrame
    
    with open(csv_file, "r") as file:
        for line in file:
            try:
                # Parse the line as a DataFrame
                df = pl.from_csv_string(line, delimiter=',')
                # Process the valid DataFrame as needed
                # ...
            except Exception:
                # If an error occurs, append the line to the bad_records DataFrame
                bad_records = bad_records.append(pl.DataFrame([line.strip().split(',')]))
    
    bad_records.to_csv("path/to/bad_records.csv") # Write the bad records to a separate CSV file
    
    ignored_rows = len(bad_records)
    print(f"Number of ignored rows: {ignored_rows}")
    

    Regarding your second question., in Pandas you can change the field delimiters when reading a CSV file by specifying the "sep" parameter in the pandas.read_csv() function. The "sep" parameter allows you to specify the delimiter character or string used in the CSV file. For example:

    df = pd.read_csv(csv_file, sep=';')  # Replace ';' with your desired delimiter