I'm running a parallel Matlab job on a single node of a remote cluster. Each node of the cluster has 2 processors with 24 cores each, for a total of 48 cores per node. The job contains some sequential code followed by a single parfor
loop. I run it using a slurm
bash script.
The bash script test.sh
is:
#!/bin/bash
#
########## Begin Slurm header ##########
#
# Give job a reasonable name
#SBATCH -J test_1
#
# Request number of nodes and CPU cores per node for job
#SBATCH --nodes=1
# Request number of tasks/process per nodes
# (determines number of workers in processed based parpool)
#SBATCH --tasks-per-node=48
# Estimated wallclock time for job
#SBATCH -t 1-00
#
# Send mail when job begins, aborts and ends
#SBATCH --mail-type=ALL
#
########### End Slurm header ##########
echo "Submit Directory: $SLURM_SUBMIT_DIR"
echo "Working Directory: $PWD"
echo "Running on host $HOSTNAME"
echo "Job id: $SLURM_JOB_ID"
echo "Job name: $SLURM_JOB_NAME"
echo "Number of nodes allocated to job: $SLURM_JOB_NUM_NODES"
echo "Number of cores allocated to job: $SLURM_NPROCS"
echo "Number of requested tasks per node: $SLURM_NTASKS_PER_NODE"
# Load module
module load math/matlab/R2020a
# Create a local working directory on scratch
mkdir -p $SCRATCH/$SLURM_JOB_ID
# Start a Matlab program
matlab -nodisplay -batch test_1 > test_1.out 2>&1
# Cleanup local working directory
rm -rf $SCRATCH/$SLURM_JOB_ID
exit
The Matlab script is
% Create parallel pool
pc = parcluster('local');
pc.JobStorageLocation = strcat(getenv('SCRATCH'),'/',getenv('SLURM_JOB_ID'));
num_workers = str2double(getenv('SLURM_NPROCS'));
parpool(pc,num_workers);
% Body of the script
% Choose deterministic parameters
free_points = 845000;
pulse_points = 1300000;
dt = 2e-11;
num_freqs = 200;
freqs = linspace(-1,1,200);
rhoi = rand(72);
rhoi = rhoi + rhoi';
rhoi = rhoi/trace(rhoi);
% Iterate over random parameters
num_pars = 5;
res = zeros(num_pars,num_freqs);
for n=1:num_pars
disp('=====');
disp(['N = ',num2str(n)]);
disp('=====');
timer = tic;
% Random parameters
H = rand(size(rhoi));
H = (H + H')/2;
L1 = rand(size(rhoi));
L2 = rand(size(rhoi));
L3 = rand(size(rhoi));
L4 = rand(size(rhoi));
L5 = rand(size(rhoi));
% Equation to solve
ME = @(rhot, t, w) -1i*w*(H*rhot - rhot*H) + (L1*rhot*L1' - (1/2)*rhot*L1'*L1 - (1/2)*L1'*L1*rhot) ...
+ (L2*rhot*L2' - (1/2)*rhot*L2'*L2 - (1/2)*L2'*L2*rhot) ...
+ (L3*rhot*L3' - (1/2)*rhot*L3'*L3 - (1/2)*L3'*L3*rhot) ...
+ (L4*rhot*L4' - (1/2)*rhot*L4'*L4 - (1/2)*L4'*L4*rhot) ...
+ (L5*rhot*L5' - (1/2)*rhot*L5'*L5 - (1/2)*L5'*L5*rhot);
% Solve equation
% IF I CHANGE TO 'for j = 1:1', ALL WORKERS ARE USED!!! MEMORY?
for j = 1:free_points
rhoi = RK4(@(rho, t) ME(rho, t, 0), rhoi, j, dt);
end
t = toc(timer);
disp(['Mid duration ',num2str(t),'s']);
parfor k=1:num_freqs
w = freqs(k);
rhop = rhoi;
for j=1:pulse_points
rhop = RK4(@(rho, t) ME(rho, t, w), rhop, j, dt);
end
for j=1:free_points
rhop = RK4(@(rho, t) ME(rho, t, 0), rhop, j, dt);
end
occ(k) = rhop(1,1);
end
% Store result
res(n,:) = occ;
end
save('res','res');
% Delete the parallel pool
delete(gcp('nocreate'));
% Local functions
function [rho] = RK4(F, rho, k, h)
k1 = F(rho, k*h);
k2 = F(rho+h*k1/2, (k+1/2)*h);
k3 = F(rho+h*k2/2, (k+1/2)*h);
k4 = F(rho+h*k3, (k+1)*h);
rho = rho+(1/6)*h*(k1+2*k2+2*k3+k4);
end
The slurm
output is
#
# SOME PERSONAL INFO HERE...
#
Number of nodes allocated to job: 1
Number of cores allocated to job: 48
Number of requested tasks per node: 48
IMPORTANT: The MATLAB Academic site license is available to employees and
enrolled students of the the universities of (CENSORED).
The license is available for teaching or research only.
Commercial applications are not permitted.
and the Matlab output is
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 48).
=====
N = 1
=====
Mid duration 3608.9535s
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 12).
#
# REST OF OUTPUT HERE...
#
You see that when the Matlab script starts, a pool of 48 workers is created. But then as the parfor
loop finally starts, parpool
restarts and the number of workers gets downgraded to 12.
I noticed that this only happens if the size of the loops is sufficiently large, even the non-parfor
loops. For instance, if I change the size of the first for
loop to 1, then parpool
does not restart. So I think it may have to do with memory usage somehow...?
Any idea what is happening and how I can get Matlab to use all 48 cores that were allocated?
EDIT: Another thing I've tried is to remove the parpool
command and specify the cluster in the parfor
loop as parfor (k=1:num_freqs,pc)
. When I do this Matlab uses one fourth of the workers no matter the size of my loop. I'll just try to contact the admins directly...
I bet your parallel pool is timing-out in between your parfor
loops. It then gets auto-created with size 12, as that is the default preference for "preferred number of workers in a parallel pool" (doc). (Personally, I don't much care for that preference, and always set the value to 99999 and let other things control the size of the pool, but in your case you might not be able to if your SLURM workers don't share a MATLAB preferences directory (prefdir
) with your client).
I suggest you create your pool of size 48 with an IdleTimeout
of Inf
, like this:
num_workers = str2double(getenv('SLURM_NPROCS'));
parpool(pc,num_workers,'IdleTimeout',Inf);