rustlifetime-scoping

Rust lifetime scoping in structs


So, I'm working on porting a string tokenizer that I wrote in Python over to Rust, and I've run into an issue I can't seem to get past with lifetimes and structs.

So, the process is basically:

  1. Get an array of files
  2. Convert each file to a Vec<String> of tokens
  3. User a Counter and Unicase to get counts of individual instances of tokens from each vec
  4. Save that count in a struct, along with some other data
  5. (Future) do some processing on the set of Structs to accumulate the total data along side the per-file data
struct Corpus<'a> {
    words: Counter<UniCase<&'a String>>,
    parts: Vec<CorpusPart<'a>>
}

pub struct CorpusPart<'a> {
    percent_of_total: f32,
    word_count: usize,
    words: Counter<UniCase<&'a String>>
}

fn process_file(entry: &DirEntry) -> CorpusPart {
    let mut contents = read_to_string(entry.path())
        .expect("Could not load contents.");

    let tokens = tokenize(&mut contents);
    let counted_words = collect(&tokens);

    CorpusPart {
        percent_of_total: 0.0,
        word_count: tokens.len(),
        words: counted_words
    }
}

pub fn tokenize(normalized: &mut String) -> Vec<String> {
    // snip ...
}

pub fn collect(results: &Vec<String>) -> Counter<UniCase<&'_ String>> {
    results.iter()
        .map(|w| UniCase::new(w))
        .collect::<Counter<_>>()
}

However, when I try to return CorpusPart it complains that it is trying to reference a local variable tokens. How can/should I deal with this? I tried adding lifetime annotations, but couldn't figure it out...

Essentially, I no longer need the Vec<String>, but I do need some of the Strings that were in it for the counter.

Any help is appreciated, thank you!


Solution

  • The issue here is that you are throwing away Vec<String>, but still referencing the elements inside it. If you no longer need Vec<String>, but still require some of the contents inside, you have to transfer the ownership to something else.

    I assume you want Corpus and CorpusPart to both point to the same Strings, so you are not duplicating Strings needlessly. If that is the case, either Corpus or CorpusPart must own the String, so that the one that don't own the String references the Strings owned by the other. (Sounds more complicated that it actually is)

    I will assume CorpusPart owns the String, and Corpus just points to those strings

    use std::fs::DirEntry;
    use std::fs::read_to_string;
    
    pub struct UniCase<a> {
        test: a
    }
    
    impl<a> UniCase<a> {
        fn new(item: a) -> UniCase<a> {
            UniCase {
                test: item
            }
        }
    }
    
    type Counter<a> = Vec<a>;
    
    struct Corpus<'a> {
        words: Counter<UniCase<&'a String>>, // Will reference the strings in CorpusPart (I assume you implemented this elsewhere)
        parts: Vec<CorpusPart>
    }
    
    pub struct CorpusPart {
        percent_of_total: f32,
        word_count: usize,
        words: Counter<UniCase<String>> // Has ownership of the strings
    }
    
    fn process_file(entry: &DirEntry) -> CorpusPart {
        let mut contents = read_to_string(entry.path())
            .expect("Could not load contents.");
    
        let tokens = tokenize(&mut contents);
        let length = tokens.len(); // Cache the length, as tokens will no longer be valid once passed to collect
        let counted_words = collect(tokens);
    
        CorpusPart {
            percent_of_total: 0.0,
            word_count: length,
            words: counted_words
        }
    }
    
    pub fn tokenize(normalized: &mut String) -> Vec<String> {
        Vec::new()
    }
    
    pub fn collect(results: Vec<String>) -> Counter<UniCase<String>> {
        results.into_iter() // Use into_iter() to consume the Vec that is passed in, and take ownership of the internal items
            .map(|w| UniCase::new(w))
            .collect::<Counter<_>>()
    }
    
    

    I aliased Counter<a> to Vec<a>, as I don't know what Counter you are using.

    Playground