I am trying to speed up my calculations and I am not sure, how to best use .into_par_iter()
or some of the Zip::
options from the ndarray
crate in order to "correctly" parallelize the computation to speed up things.
I have tried 2 ways, but I think due to my understanding of Mutex
the whole Array gets "locked", which just creates more overhead and the calculations are slower than before.
//* Step 2.1: Calculate the overlap matrix S
let S_matr_tmp = Array2::<f64>::zeros((n, n));
let S_matr_mutex = Mutex::new(S_matr_tmp);
(0..n).into_par_iter().for_each(|i| {
(0..=i).into_par_iter().for_each(|j| {
let mut s = S_matr_mutex.lock().unwrap();
if i == j {
s[(i, j)] = 1.0;
// s = 1.0;
} else {
s[(i, j)] = calc_overlap_int_cgto(
&self.mol.wfn_total.basis_set_total.basis_set_cgtos[i],
&self.mol.wfn_total.basis_set_total.basis_set_cgtos[j],
);
s[(j, i)] = s[(i, j)];
}
})
});
self.mol.wfn_total.HF_Matrices.S_matr = S_matr_mutex.into_inner().unwrap();
let D_matr_mutex = Mutex::new(Array2::<f64>::zeros((no_cgtos, no_cgtos)));
//* D^0 matrix */
//* Trying to parallelize it */
Zip::indexed(C_matr_AO_basis.axis_iter(Axis(0))).par_for_each(|mu, row1| {
Zip::indexed(C_matr_AO_basis.outer_iter()).par_for_each(|nu, row2| {
let mut d = D_matr_mutex.lock().unwrap();
let slice1 = row1.slice(s![..no_occ_orb]);
let slice2 = row2.slice(s![..no_occ_orb]);
d[(mu, nu)] = slice1.dot(&slice2);
});
});
let mut D_matr = D_matr_mutex.into_inner().unwrap();
let F_matr_mutex = Mutex::new(F_matr);
(0..no_cgtos).into_par_iter().for_each(|mu| {
(0..no_cgtos).into_par_iter().for_each(|nu| {
(0..no_cgtos).into_par_iter().for_each(|lambda| {
(0..no_cgtos).into_par_iter().for_each(|sigma| {
let mu_nu_lambda_sigma =
calc_ijkl_idx(mu + 1, nu + 1, lambda + 1, sigma + 1);
let mu_lambda_nu_sigma =
calc_ijkl_idx(mu + 1, lambda + 1, nu + 1, sigma + 1);
let mut f = F_matr_mutex.lock().unwrap();
f[(mu, nu)] += D_matr[(lambda, sigma)]
* (2.0
* self.mol.wfn_total.HF_Matrices.ERI_arr1[mu_nu_lambda_sigma]
- self.mol.wfn_total.HF_Matrices.ERI_arr1[mu_lambda_nu_sigma]);
});
});
});
});
let F_matr = F_matr_mutex.into_inner().unwrap();
Any help on how to parallelize the code "correctly" and more information on how to use Mutex
better would be appreciated.
Thanks to @adamreichold I was able to understand more about my problem and then "correctly" parallelized my code which led to a noticable speed-up of my calculations.
Now to the explaination:
for
loops "as is".ndarray
can be used in the "classical" way with for
loops, but seems to be more efficient in the functional programming way, in the sense that you are apply e.g. a for_each
directly on a object
instead of looping with indices and then accessing fields with the indices.Zip::
methods.'Better' verison for example 1:
//* New version: par_for_each and indexed */
self.mol.wfn_total.HF_Matrices.S_matr = Array2::<f64>::zeros((n, n));
Zip::indexed(&mut self.mol.wfn_total.HF_Matrices.S_matr).par_for_each(|(i, j), s_val| {
if i == j {
*s_val = 1.0;
} else if i >= j {
//* Calculate only lower triangle matrix */
*s_val = calc_overlap_int_cgto(
&self.mol.wfn_total.basis_set_total.basis_set_cgtos[i],
&self.mol.wfn_total.basis_set_total.basis_set_cgtos[j],
)
} else {
*s_val = 0.0;
}
});
// * Assign lower triangle to upper triangle with slices -> larger chunks
for i in 0..n - 1 {
let slice = self
.mol
.wfn_total
.HF_Matrices
.S_matr
.slice(s![i + 1..n, i])
.to_shared();
self.mol
.wfn_total
.HF_Matrices
.S_matr
.slice_mut(s![i, i + 1..n])
.assign(&slice);
}
Small explaination of the code:
Zip::indexed
needs a &mut arr
to index the arr
and the par_for_each
goes through all the elements with a par_iter
.(i,j)
are the 2 indices that are needed to index into the basis_set_cgtos
(which is a Vec
)The rest of the examples can be parallelized analogously (code in my GitHub repo: SCF_MRJD_RS)