How can I calculate the numeric distance of two sequences under the requirement that sequence elements may only be swapped with adjacent elements? Insertion, and deletion are not allowed. The distance function must determine the number of swap operations necessary to reach one string from the other. Furthermore, the distance function is not defined for sequences of different lengths, or if the sequences don't share the same symbols. Lastly, you can assume that there are no duplicates in a sequence.
Based on my research, I would need something like the Damerau–Levenshtein distance, but I don't know how to modify the distance function correctly.
For the sequence ABC
, the only two valid transformations with distance 1
are BAC
or ACB
.
use std::collections::HashSet;
use std::hash::Hash;
fn dist<T: PartialEq + Eq + Hash>(lhs: &[T], rhs: &[T]) -> Option<usize> {
// Distance function is not defined for differing lengths
if lhs.len() != rhs.len() {
return None;
}
let lhs_c: HashSet<_> = lhs.iter().collect();
let rhs_c: HashSet<_> = rhs.iter().collect();
// Distance function is not defined if an element occurs more than once in each sequence
if lhs.len() != lhs_c.len() || rhs.len() != rhs_c.len() {
return None;
}
// Distance function is not defined if the sequences don't share all elements
if lhs_c != rhs_c {
return None;
}
// Calculate the edit distance of `lhs` and `rhs` with only adjacent transpositions.
todo!()
}
#[test]
fn dist_is_none_for_differing_lengths() {
assert_eq!(None, dist(&["b", "a", "c"], &["a", "b", "c", "d"]));
}
#[test]
fn dist_is_none_with_duplicates() {
assert_eq!(None, dist(&["b", "a", "c", "a"], &["a", "b", "c", "d"]));
assert_eq!(None, dist(&["a", "b", "c", "d"], &["a", "b", "c", "c"]));
}
#[test]
fn dist_is_none_with_non_empty_difference() {
assert_eq!(None, dist(&["a", "b", "c"], &["d", "e", "f"]));
}
#[test]
fn dist_bac_to_abc_is_one() {
assert_eq!(Some(1), dist(&["b", "a", "c"], &["a", "b", "c"]));
}
#[test]
fn dist_bca_to_bac_is_one() {
assert_eq!(Some(1), dist(&["b", "c", "a"], &["b", "a", "c"]));
}
#[test]
fn dist_cab_to_abc_is_two() {
assert_eq!(Some(2), dist(&["c", "a", "b"], &["a", "b", "c"]));
}
#[test]
fn dist_cdab_to_abcd_is_four() {
assert_eq!(Some(4), dist(&["c", "d", "a", "b"], &["a", "b", "c", "d"]));
}
You can create a mapping from the elements to numbers in such a way that when it is applied to one of the sequences, it will be sorted. Then you can apply this mapping to the other sequence. After this, you can calculate the number of swaps required to bubble sort the second sequence: (plaground)
use std::collections::HashMap;
use core::hash::Hash;
use std::collections::HashSet;
fn create_mapping<T: Eq + Hash>(seq: &[T]) -> HashMap<&T, usize> {
let mut ret = HashMap::new();
for (i, elem) in seq.iter().enumerate() {
ret.insert(elem, i);
}
ret
}
fn dist<T: Eq + Hash>(lhs: &[T], rhs: &[T]) -> Option<usize> {
// Distance function is not defined for differing lengths
if lhs.len() != rhs.len() {
return None;
}
let lhs_c: HashSet<_> = lhs.iter().collect();
let rhs_c: HashSet<_> = rhs.iter().collect();
// Distance function is not defined if an element occurs more than once in each sequence
if lhs.len() != lhs_c.len() || rhs.len() != rhs_c.len() {
return None;
}
// Distance function is not defined if the sequences don't share all elements
if lhs_c != rhs_c {
return None;
}
// Calculate the edit distance of `lhs` and `rhs` with only adjacent transpositions.
let mapping = create_mapping(lhs);
let rhs_mapped: Vec<usize> = rhs.iter().map(|e|mapping[e]).collect();
Some(bubble_sort_length(&rhs_mapped))
}
fn bubble_sort_length(seq: &[usize]) -> usize {
let mut ret = 0;
for i in 0..seq.len() {
for j in (i+1)..seq.len() {
if seq[i]>seq[j] {
ret += 1;
}
}
}
ret
}
#[test]
fn dist_is_none_for_differing_lengths() {
assert_eq!(None, dist(&["b", "a", "c"], &["a", "b", "c", "d"]));
}
#[test]
fn dist_is_none_with_duplicates() {
assert_eq!(None, dist(&["b", "a", "c", "a"], &["a", "b", "c", "d"]));
assert_eq!(None, dist(&["a", "b", "c", "d"], &["a", "b", "c", "c"]));
}
#[test]
fn dist_is_none_with_non_empty_difference() {
assert_eq!(None, dist(&["a", "b", "c"], &["d", "e", "f"]));
}
#[test]
fn dist_bac_to_abc_is_one() {
assert_eq!(Some(1), dist(&["b", "a", "c"], &["a", "b", "c"]));
}
#[test]
fn dist_bca_to_bac_is_one() {
assert_eq!(Some(1), dist(&["b", "c", "a"], &["b", "a", "c"]));
}
#[test]
fn dist_cab_to_abc_is_two() {
assert_eq!(Some(2), dist(&["c", "a", "b"], &["a", "b", "c"]));
}
#[test]
fn dist_cdab_to_abcd_is_four() {
assert_eq!(Some(4), dist(&["c", "d", "a", "b"], &["a", "b", "c", "d"]));
}
This algorithm will have the time complexity of O(n²)
where n
is the length of the sequences. There is probably some better way to implement the bubble_sort_length
function (see https://www.geeksforgeeks.org/inversion-count-in-array-using-merge-sort/ which claims to have time complexity O(n*log(n))
)