csvrustpython-polarsrust-polarsoscilloscope

Selecting with Indexing is an anti-pattern in Polars: How to parse and transform (select/filter?) a CSV that seems to require so?


I would like to read the following (quite broken, IMO) CSV with Pola-rs, coming from a Rigol MSO5000 oscilloscope:

D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
+2.470000E+02,2.000000E+00,,
+1.590000E+02,1.600000E+01,,
+2.400000E+02,2.000000E+00,,
+2.470000E+02,+1.300000E+02,,
+1.590000E+02,1.800000E+01,,
+2.470000E+02,+1.300000E+02,,
9.500000E+01,1.800000E+01,,
9.500000E+01,1.800000E+01,,
+2.400000E+02,0.000000E+00,,
(...)

Here's my current Jupyter Notebook iteration/attempt before finding out that Selecting with indexing is discouraged in Pola-rs:

import polars as pl

df = pl.read_csv("normal0.csv")

# Grab initial condition and increments
t0 = df.columns[2]; assert "t0" in t0; t0 = float(t0.split('=')[1].replace('s', '').strip())
tinc = df.columns[3]; assert "tInc" in tinc; tinc = float(tinc.split('=')[1].strip())

# TODO: Generate Series() from t0+tinc and include it in the final DataFrame

# Reshape and cleanup
probes = df.with_columns(df["D7-D0"].cast(pl.Float32).alias("D0-D7 floats")) \
             .with_columns(df["D15-D8"].cast(pl.Float32).alias("D15-D8 floats")) \
             .drop(df.columns[2]) \
             .drop(df.columns[3])
(...)

def split_probes(probes: pl.Series):
    ''' Splits bundles of probe cables such as D0-D7 or D15-D8 into individual dataframe columns
    '''
    out = pl.DataFrame(schema=["D"+str(line) for line in range(0,16)])
#    for row in range(probes.height):
#        for probe in range(0, 7):
#            out["D"+str(probe)].with_columns(probes["D0-D7 floats"][row % (probe + 1)])
#         for probe in reversed(range(9, 16)): # TODO: Fix future captures or parametrise this
#             outprobes["D15-D8 floats"][row % probe]

And here's my lower-level CSV parsing Rust pseudocode approach when I was told on Polars Discord that this problem might not be optimally solvable with dataframe libraries:

use csv;
use std::error::Error;
use std::io;
use std::process;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct OrigOscilloscope {
    #[serde(rename = "D7-D0")]
    d7_d0: String, // TODO: Unfortunately those fields are "user-flippable" in order from the scope, i.e: d0_d7 vs d7_d0
    #[serde(rename = "D15-D8")]
    d15_d8: String,
    // Do not even register those on Serde as they are empty rows anyway
    // t0: Option<String>,
    // t_inc: Option<String>
}

#[derive(Debug, Deserialize)]
struct LAProbesOscilloscopeState {
    //Vec<d0...d15>: Vec<Vec<16*f32>>, // TODO: More appropriate struct representation for the target dataframe
    d0: f32,
    d1: f32,
    d2: f32,
    d3: f32,
    d4: f32,
    d5: f32,
    d6: f32,
    d7: f32,
    d8: f32,
    d9: f32,
    d10: f32,
    d11: f32,
    d12: f32,
    d13: f32,
    d14: f32,
    d15: f32,
    timestamp: f32
}

fn run() -> Result<(), Box<dyn Error>> {
    let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .flexible(true) // ignore broken header
        .from_reader(io::stdin());

    // Get timeseries information from header... :facepalm: rigol
    // Initial timestamp...
    let header = rdr.headers()?.clone();
    let t0_header: Vec<&str> = header[2].split('=').collect();
    let t0 = t0_header[1].trim_start().replace('s', "").parse::<f32>()?;
    // ...and increments
    let tinc_header: Vec<&str> = header[3].split('=').collect();
    let tinc = tinc_header[1].trim_start().parse::<f32>()?;
    println!("Initial timestamp {t0} with increments of {tinc} seconds");

    // Now do the splitting of wires from its Dx-Dy "bundles"
    let mut timestamp = t0;
    for result in rdr.deserialize().skip(1) {
        let row: OrigOscilloscope = result?;

        // if rdr.position().line().rem_euclid(8) {
        // // Read D0-D15 field(s), expanding them into the current row, matching its column
        // }
        // println!("{:#?}", row.d7_d0.parse::<f32>()?);

        // update timestamp for this row
        timestamp = timestamp + tinc;
    }
    Ok(())
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

I hope that it's clear that the target dataframe desired would look like this:

d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15, timestamp
95, 95, 247, 159, 247, 240, 159, 247 (...)                            -25.01+000000.2
(...)

And the resulting code should (ideally?) be using Polars efficiently. Also, as mentioned above, avoiding "Selecting with Indexing" since it might be completely deprecated in the future in Polars.


Solution

  • Perhaps you are looking for unstack (which is incredibly performant).

    Let's start with this data (which simply replicates values that you provided). I've also changed the column names, just to make things easier to inspect:

    import polars as pl
    from io import StringIO
    
    normal_csv = """D7-D0,D15-D8,t0 = -25.01s, tInc = 2e-06,
    +2.470000E+02,2.000000E+00,,
    +1.590000E+02,1.600000E+01,,
    +2.400000E+02,2.000000E+00,,
    +2.470000E+02,+1.300000E+02,,
    +1.590000E+02,1.800000E+01,,
    +2.470000E+02,+1.300000E+02,,
    9.500000E+01,1.800000E+01,,
    9.500000E+01,1.800000E+01,,
    +2.470000E+02,2.000000E+00,,
    +1.590000E+02,1.600000E+01,,
    +2.400000E+02,2.000000E+00,,
    +2.470000E+02,+1.300000E+02,,
    +1.590000E+02,1.800000E+01,,
    +2.470000E+02,+1.300000E+02,,
    9.500000E+01,1.800000E+01,,
    9.500000E+01,1.800000E+01,,"""
    
    
    df = pl.read_csv(source=StringIO(normal_csv),
                     new_columns=('D', 'E', 't0', 'tInc'))
    df
    
    shape: (16, 5)
    ┌───────┬───────┬──────┬──────┬──────┐
    │ D     ┆ E     ┆ t0   ┆ tInc ┆      │
    │ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---  │
    │ f64   ┆ f64   ┆ str  ┆ str  ┆ str  │
    ╞═══════╪═══════╪══════╪══════╪══════╡
    │ 247.0 ┆ 2.0   ┆ null ┆ null ┆ null │
    │ 159.0 ┆ 16.0  ┆ null ┆ null ┆ null │
    │ 240.0 ┆ 2.0   ┆ null ┆ null ┆ null │
    │ 247.0 ┆ 130.0 ┆ null ┆ null ┆ null │
    │ 159.0 ┆ 18.0  ┆ null ┆ null ┆ null │
    │ 247.0 ┆ 130.0 ┆ null ┆ null ┆ null │
    │ 95.0  ┆ 18.0  ┆ null ┆ null ┆ null │
    │ 95.0  ┆ 18.0  ┆ null ┆ null ┆ null │
    │ 247.0 ┆ 2.0   ┆ null ┆ null ┆ null │
    │ 159.0 ┆ 16.0  ┆ null ┆ null ┆ null │
    │ 240.0 ┆ 2.0   ┆ null ┆ null ┆ null │
    │ 247.0 ┆ 130.0 ┆ null ┆ null ┆ null │
    │ 159.0 ┆ 18.0  ┆ null ┆ null ┆ null │
    │ 247.0 ┆ 130.0 ┆ null ┆ null ┆ null │
    │ 95.0  ┆ 18.0  ┆ null ┆ null ┆ null │
    │ 95.0  ┆ 18.0  ┆ null ┆ null ┆ null │
    └───────┴───────┴──────┴──────┴──────┘
    

    Since I'm not familiar with your data source, I'm going to assume that your input data has a regularized pattern -- specifically that D7-D0 are always provided (i.e., no skipped lines in the csv file).

    If so, here's some highly performant code that should get the ball rolling...

    (
        df
        .reverse()
        .unstack(step=8,
                 how='horizontal',
                 columns=('D', 'E'),
                 )
    )
    
    shape: (2, 16)
    ┌──────┬──────┬───────┬───────┬───────┬───────┬───────┬───────┬──────┬──────┬───────┬──────┬───────┬─────┬──────┬─────┐
    │ D_0  ┆ D_1  ┆ D_2   ┆ D_3   ┆ D_4   ┆ D_5   ┆ D_6   ┆ D_7   ┆ E_0  ┆ E_1  ┆ E_2   ┆ E_3  ┆ E_4   ┆ E_5 ┆ E_6  ┆ E_7 │
    │ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---  ┆ ---  ┆ ---   ┆ ---  ┆ ---   ┆ --- ┆ ---  ┆ --- │
    │ f64  ┆ f64  ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64  ┆ f64  ┆ f64   ┆ f64  ┆ f64   ┆ f64 ┆ f64  ┆ f64 │
    ╞══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪══════╪══════╪═══════╪══════╪═══════╪═════╪══════╪═════╡
    │ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
    │ 95.0 ┆ 95.0 ┆ 247.0 ┆ 159.0 ┆ 247.0 ┆ 240.0 ┆ 159.0 ┆ 247.0 ┆ 18.0 ┆ 18.0 ┆ 130.0 ┆ 18.0 ┆ 130.0 ┆ 2.0 ┆ 16.0 ┆ 2.0 │
    └──────┴──────┴───────┴───────┴───────┴───────┴───────┴───────┴──────┴──────┴───────┴──────┴───────┴─────┴──────┴─────┘
    

    This doesn't have your desired column names (which you can change). Nor does it contain the timing information. But it may help you get started in the right direction.