sortingrustapache-arrowrust-arrow2

How can I order an arrow2 Chunk by a given column in rust?


I have loaded a large arrow2 Chunk with a large set of columns/arrays and before writing it to a parquet file, I would like to order it by a given column. Look at this code:

fn main(){
    use arrow2::{array::*, compute::sort};
    use arrow2::chunk::Chunk;

    let mut col1: Int64Vec = Int64Vec::new();
    col1.push(Some(0));
    col1.push(Some(5));
    col1.push(Some(3));
    col1.push(Some(2));

    let mut col2: Int64Vec = Int64Vec::new();
    col2.push(Some(1));
    col2.push(Some(2));
    col2.push(Some(3));
    col2.push(Some(4));

    let mut chu = Chunk::new(vec![col1.into_arc(), col2.into_arc()]);

    chu.sort_by_key();

}

Obviously this fails, since it wouldn't know by which column to sort, but I have been unable to use any of the .sort_* functions. I would like to sort 'chu' by the first column.

I have tried to write the index extracting function for the '.sort_by_key' function, but no dice. Also google and geminied about it...


Solution

  • TLDR: use the 'lexsort' function. It is a full fledged version of the simplistic 'sort' function.

    At first, one would think this function is related to text ordering (capital vs non-capital, special characters, and such), but not really.

    On another note, if you just want to save your columns to a parquet file, as I did, consider using parquet's own column sorting option inside 'WriterProperties'.