As an educational exercise, I'm looking at porting cvs-fast-export to Rust.
Its basic mode of operation is to parse a number of CVS master files into a intermediate form, and then to analyse the intermediate form with the goal of transforming it into a git fast-export stream.
One of the things that is done when parsing is to convert common parts of the intermediate form into a canonical representation. A motivating example is commit authors. A CVS repository may have hundreds of thousands of individual file commits, but probably less than a thousand authors. So an interning table is used when parsing where you input the author as you parse it from the file and it will give you a pointer to a canonical version, creating a new one if it hasn't seen it before. (I've heard this called atomizing or interning too). This pointer then gets stored on the intermediate object.
My first attempt to do something similar in Rust attempted to use a HashSet
as the interning table. Note this uses CVS version numbers rather than authors, this is just a sequence of digits such as 1.2.3.4, represented as a Vec
.
use std::collections::HashSet;
use std::hash::Hash;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Vec<u16>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> &T {
let dupe = item.clone();
if !set.contains(&item) {
set.insert(item);
}
set.get(&dupe).unwrap()
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(vec![1, 2]);
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(vec![1, 2]);
let c4 = intern(&mut set, c3);
}
This fails with error[E0499]: cannot borrow 'set' as mutable more than once at a time
. This is fair enough, HashSet
doesn't guarantee references to its keys will be valid if you add more items after you have obtained a reference. The C version is careful to guarantee this. To get this guarantee, I think the HashSet
should be over Box<T>
. However I can't explain the lifetimes for this to the borrow checker.
The ownership model I am going for here is that the interning table owns the canonical versions of the data, and hands out references. The references should be valid as long the interning table exists. We should be able to add new things to the interning table without invalidating the old references. I think the root of my problem is that I'm confused how to write the interface for this contract in a way consistent with the Rust ownership model.
Solutions I see with my limited Rust knowledge are:
HashSet
on the first pass, then freeze it and use references on the second pass. This means additional temporary storage (sometimes substantial).Does anyone have a better idea?
I somewhat disagree with @Shepmaster on the use of unsafe
here.
While right now it does not cause issue, should someone decide in the future to change the use of HashSet
to include some pruning (for example, to only ever keep a hundred authors in there), then unsafe
will bite you sternly.
In the absence of a strong performance reason, I would simply use a Rc<XXX>
. You can alias it easily enough: type InternedXXX = Rc<XXX>;
.
use std::collections::HashSet;
use std::hash::Hash;
use std::rc::Rc;
#[derive(PartialEq, Eq, Debug, Hash, Clone)]
struct CvsNumber(Rc<Vec<u16>>);
fn intern<T:Eq + Hash + Clone>(set: &mut HashSet<T>, item: T) -> T {
if !set.contains(&item) {
let dupe = item.clone();
set.insert(dupe);
item
} else {
set.get(&item).unwrap().clone()
}
}
fn main() {
let mut set: HashSet<CvsNumber> = HashSet::new();
let c1 = CvsNumber(Rc::new(vec![1, 2]));
let c2 = intern(&mut set, c1);
let c3 = CvsNumber(Rc::new(vec![1, 2]));
let c4 = intern(&mut set, c3);
}