rustone-hot-encodingapache-arrow

One-hot-encoding while loading data with arrow-rs


In my Rust project I am loading documents from Mongo and deserialize them into serde_json Values:

match cursor.deserialize_current() {
    Ok(d) => {
        let doc = serde_json::to_value(&d).unwrap();
        doc_vec.push(doc);
    }

After that I create an arrow RecordBatch using the decoder:

let mut decoder = ReaderBuilder::new(schema.clone()).build_decoder().unwrap();
if !doc_vec.is_empty() {
    decoder.serialize(&doc_vec).unwrap();
    let batch = decoder.flush().unwrap().unwrap();

My schema is:

let schema = Schema::new(vec![
    Field::new("Amount", DataType::Float32, false),
    Field::new(
        "Country",
        DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
        false,
    ),
]);

The code fails with:

called `Result::unwrap()` on an `Err` value: NotYetImplemented("Support for Dictionary(UInt16, Utf8) in JSON reader")called `Result::unwrap()` on an `Err` value: NotYetImplemented("Support for Dictionary(UInt16, Utf8) in JSON reader")

I want the country to be one-hot-encoded when I send it to a pyarrow client via arrow flight, to convert it to a Pandas dataframe afterwards.

Can you guide me how to continue from here? I'm quite new to all of the used technologies.


Solution

  • A workaround would be to read the column as Utf8 and then use the cast kernel to convert it to dictionary encoding.

    From my understanding though, one-hot encoding is different than dictionary encoding. You could get one-hot encoded boolean columns by using the comparison kernels, comparing against the distinct "country" values.