rustrust-ndarray

What is the most efficient way to read the first line of a file separately to the rest of the file?


I am trying to figure out the best way to read the contents of a file. The problem is that I need to read the first line separately, because I need that to be parsed as a usize which I need for the dimension of a Array2 by ndarray.

I tried the following:

use ndarray::prelude::*;
use std::io:{BufRead,BufReader};
use std::fs;


fn read_inputfile(geom_filename: &str) -> (Vec<i32>, Array2<f64>, usize) {
    //* Step 1: Read the coord data from input
    println!("Inputfile: {geom_filename}");

    let geom_file = fs::File::open(geom_filename).expect("Geometry file not found!");
    let geom_file_reader = BufReader::new(geom_file);
    let geom_file_lines: Vec<String> = geom_file_reader
        .lines()
        .map(|line| line.expect("Failed to read line!"))
        .collect();

    //* Read no of atoms first for array size
    let no_atoms: usize = geom_file_lines[0].parse().unwrap();

    let mut Z_vals: Vec<i32> = Vec::new();
    let mut geom_matr: Array2<f64> = Array2::zeros((no_atoms, 3));

    for (atom_idx, line) in geom_file_lines[1..].iter().enumerate() {
        //* into_iter would do the same
        let line_split: Vec<&str> = line.split_whitespace().collect();

        Z_vals.push(line_split[0].parse().unwrap());

        (0..3).for_each(|cart_coord| {
            geom_matr[(atom_idx, cart_coord)] = line_split[cart_coord + 1].parse().unwrap();
        });
    }

    (Z_vals, geom_matr, no_atoms)
}

Does this not kind of defeat the purpose of the BufReader? I am still relative new to Rust, so I might have misunderstood something, but I thought that one uses the BufReader so that the whole file does not need to be read into memory.

With the Vec<String> for geom_file_lines I am mostlike loading the whole file into memory again, right?


Solution

  • Does this not kind of defeat the purpose of the BufReader?

    It very much does, yes. lines() gives you an iterator, so you can read them without loading all of them into memory at once. You force them all into memory, though, as you call collect().

    Simply don't do that. Use the iterator as an iterator. Especially as you convert it back to an iterator later, via geom_file_lines[1..].iter().

    Like this:

    use ndarray::prelude::*;
    use std::fs;
    use std::io::{BufRead, BufReader};
    
    pub fn read_inputfile(geom_filename: &str) -> (Vec<i32>, Array2<f64>, usize) {
        //* Step 1: Read the coord data from input
        println!("Inputfile: {geom_filename}");
    
        let geom_file = fs::File::open(geom_filename).expect("Geometry file not found!");
        let geom_file_reader = BufReader::new(geom_file);
        let mut geom_file_lines = geom_file_reader
            .lines()
            .map(|line| line.expect("Failed to read line!"));
    
        //* Read no of atoms first for array size
        let no_atoms: usize = geom_file_lines.next().unwrap().parse().unwrap();
    
        let mut z_vals: Vec<i32> = Vec::new();
        let mut geom_matr: Array2<f64> = Array2::zeros((no_atoms, 3));
    
        for (atom_idx, line) in geom_file_lines.enumerate() {
            let line_split: Vec<&str> = line.split_whitespace().collect();
    
            z_vals.push(line_split[0].parse().unwrap());
    
            (0..3).for_each(|cart_coord| {
                geom_matr[(atom_idx, cart_coord)] = line_split[cart_coord + 1].parse().unwrap();
            });
        }
    
        (z_vals, geom_matr, no_atoms)
    }
    

    You can apply the same logic in your for loop:

        for (atom_idx, line) in geom_file_lines.enumerate() {
            let mut line_split = line.split_whitespace();
    
            z_vals.push(line_split.next().unwrap().parse().unwrap());
    
            (0..3).for_each(|cart_coord| {
                geom_matr[(atom_idx, cart_coord)] = line_split.next().unwrap().parse().unwrap();
            });
        }