stringsortingruststructbam

Sort vec of structs by custom (natural?) order of String field and numeric field


I am looking for suggestions on how to sort a vector of structs based on a custom order for a String field (unknown at compile time). If the strings are equal, then a numeric value is used for further sorting. (The application is sorting by chromosome, then by position, in genetics.)

The order of the strings will be based on the input provided by the user (the order of chromosomes in the bam file).

The code below works to sort on String and position, omitting parts I believe to be unnecessary. I also implement PartialOrd, PartialEq and Eq. In the actual code I then call genotypes_vec.sort_unstable() to get a sorted vector.

struct Genotype {
    chrom: String,
    position: u32,

}

impl Ord for Genotype {
    fn cmp(&self, other: &Self) -> Ordering {
        (self.chrom.clone(), &self.position, ).cmp(&(
            other.chrom.clone(),
            &other.position,
        ))

Now this will sort strings/chromosomes as chr1, chr10, chr2 (alphabetical sort), while the desired order would be chr1, chr2, chr10. Therefore note that a satisfying solution would also be a natural sort of strings, but I haven't been able to implement that either. I found https://docs.rs/human-sort/latest/human_sort/, but I am not sure how to apply that sort for the strings, while still sorting for the position if self.chrom == other.chrom

The code below does not entirely match the requirements, as I cannot find a way to pass the desired order of strings to the cmp() function - it only accepts two arguments

impl Ord for Genotype {
    fn cmp(&self, other: &Self, chrom_order: Vec<String>) -> Ordering {
        let chrom_order = vec!["chr1".to_string(), "chr10".to_string()];
        let chrom_test = chrom_order
            .iter()
            .position(|x| x == &self.chrom)
            .unwrap()
            .cmp(&chrom_order.iter().position(|x| x == &other.chrom).unwrap());
        if chrom_test == Ordering::Equal {
            // If same chromosome
            return self.start.cmp(&other.start);
        }
        chrom_test
    }
}
method `cmp` has 3 parameters but the declaration in trait `std::cmp::Ord::cmp` has 2
`cmp` from trait: `fn(&Self, &Self) -> std::cmp::Ordering`rustcE0050

Generating the chrom_order on the fly every time cmp() is called seems undesirable too, and probably also wouldn't work - as that would also require passing arguments.


Solution

  • cmp and human_sort::compare both return an Ordering. The standard library provides composition methods for Ordering.

    So you should be able to write something like:

    human_sort::compare(&self.chrom, &other.chrom)
        .then(self.position.cmp(&other.position))
    

    That will order the entries based on their chrom and break tie based on position.

    Though you also need to derive PartialEq and Eq, and implement PartialOrd (a trivial implementation works fine but a derivation I don't think does).

    Full example: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5a3d648b9b15b351b6e9646a58e6da60