bashmatlabparallel-processingslurmparfor

Matlab parfor uses fewer cores than the allocated number of cores


I'm running a parallel Matlab job on a single node of a remote cluster. Each node of the cluster has 2 processors with 24 cores each, for a total of 48 cores per node. The job contains some sequential code followed by a single parfor loop. I run it using a slurm bash script.

The bash script test.sh is:

#!/bin/bash
#
########## Begin Slurm header ##########
#
# Give job a reasonable name
#SBATCH -J test_1
#
# Request number of nodes and CPU cores per node for job
#SBATCH --nodes=1
# Request number of tasks/process per nodes
# (determines number of workers in processed based parpool)
#SBATCH --tasks-per-node=48
# Estimated wallclock time for job
#SBATCH -t 1-00
#
# Send mail when job begins, aborts and ends
#SBATCH --mail-type=ALL
#
########### End Slurm header ##########

echo "Submit Directory:                     $SLURM_SUBMIT_DIR"
echo "Working Directory:                    $PWD"
echo "Running on host                       $HOSTNAME"
echo "Job id:                               $SLURM_JOB_ID"
echo "Job name:                             $SLURM_JOB_NAME"
echo "Number of nodes allocated to job:     $SLURM_JOB_NUM_NODES"
echo "Number of cores allocated to job:     $SLURM_NPROCS"
echo "Number of requested tasks per node:   $SLURM_NTASKS_PER_NODE"

# Load module
module load math/matlab/R2020a

#   Create a local working directory on scratch
mkdir -p $SCRATCH/$SLURM_JOB_ID

# Start a Matlab program
matlab -nodisplay -batch test_1 > test_1.out 2>&1

# Cleanup local working directory
rm -rf $SCRATCH/$SLURM_JOB_ID

exit

The Matlab script is

% Create parallel pool

pc = parcluster('local');

pc.JobStorageLocation = strcat(getenv('SCRATCH'),'/',getenv('SLURM_JOB_ID'));
 
num_workers = str2double(getenv('SLURM_NPROCS'));
parpool(pc,num_workers);

% Body of the script

% Choose deterministic parameters

free_points = 845000;
pulse_points = 1300000;
dt = 2e-11;

num_freqs = 200;
freqs = linspace(-1,1,200);

rhoi = rand(72);
rhoi = rhoi + rhoi';
rhoi = rhoi/trace(rhoi);

% Iterate over random parameters

num_pars = 5;
res = zeros(num_pars,num_freqs);
for n=1:num_pars

    disp('=====');
    disp(['N = ',num2str(n)]);
    disp('=====');

    timer = tic;

    % Random parameters

    H = rand(size(rhoi));
    H = (H + H')/2;

    L1 = rand(size(rhoi));
    L2 = rand(size(rhoi));
    L3 = rand(size(rhoi));
    L4 = rand(size(rhoi));
    L5 = rand(size(rhoi));

    % Equation to solve

    ME = @(rhot, t, w)  -1i*w*(H*rhot - rhot*H) + (L1*rhot*L1' - (1/2)*rhot*L1'*L1 - (1/2)*L1'*L1*rhot) ...
                                                + (L2*rhot*L2' - (1/2)*rhot*L2'*L2 - (1/2)*L2'*L2*rhot) ...
                                                + (L3*rhot*L3' - (1/2)*rhot*L3'*L3 - (1/2)*L3'*L3*rhot) ...
                                                + (L4*rhot*L4' - (1/2)*rhot*L4'*L4 - (1/2)*L4'*L4*rhot) ...
                                                + (L5*rhot*L5' - (1/2)*rhot*L5'*L5 - (1/2)*L5'*L5*rhot);

    % Solve equation

    % IF I CHANGE TO 'for j = 1:1', ALL WORKERS ARE USED!!! MEMORY?
    for j = 1:free_points
       rhoi =  RK4(@(rho, t) ME(rho, t, 0), rhoi, j, dt);
    end

    t = toc(timer);
    disp(['Mid duration ',num2str(t),'s']);

    parfor k=1:num_freqs
        w = freqs(k);
        
        rhop = rhoi;
        
        for j=1:pulse_points
            rhop = RK4(@(rho, t) ME(rho, t, w), rhop, j, dt);
        end
        
        for j=1:free_points
            rhop = RK4(@(rho, t) ME(rho, t, 0), rhop, j, dt);
        end

        occ(k) = rhop(1,1);
    end

    % Store result

    res(n,:) = occ;

end

save('res','res');

% Delete the parallel pool

delete(gcp('nocreate'));


% Local functions

function [rho] = RK4(F, rho, k, h)

k1 = F(rho, k*h);
k2 = F(rho+h*k1/2, (k+1/2)*h);
k3 = F(rho+h*k2/2, (k+1/2)*h);
k4 = F(rho+h*k3, (k+1)*h);

rho = rho+(1/6)*h*(k1+2*k2+2*k3+k4);

end

The slurm output is

#
# SOME PERSONAL INFO HERE...
#
Number of nodes allocated to job:     1
Number of cores allocated to job:     48
Number of requested tasks per node:   48
 IMPORTANT: The MATLAB Academic site license is available to employees and
 enrolled students of the the universities of (CENSORED).
 The license is available for teaching or research only.
 Commercial applications are not permitted.

and the Matlab output is

Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 48).
=====
N = 1
=====
Mid duration 3608.9535s
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 12).

#
# REST OF OUTPUT HERE...
#

You see that when the Matlab script starts, a pool of 48 workers is created. But then as the parfor loop finally starts, parpool restarts and the number of workers gets downgraded to 12.

I noticed that this only happens if the size of the loops is sufficiently large, even the non-parfor loops. For instance, if I change the size of the first for loop to 1, then parpool does not restart. So I think it may have to do with memory usage somehow...?

Any idea what is happening and how I can get Matlab to use all 48 cores that were allocated?

EDIT: Another thing I've tried is to remove the parpool command and specify the cluster in the parfor loop as parfor (k=1:num_freqs,pc). When I do this Matlab uses one fourth of the workers no matter the size of my loop. I'll just try to contact the admins directly...


Solution

  • I bet your parallel pool is timing-out in between your parfor loops. It then gets auto-created with size 12, as that is the default preference for "preferred number of workers in a parallel pool" (doc). (Personally, I don't much care for that preference, and always set the value to 99999 and let other things control the size of the pool, but in your case you might not be able to if your SLURM workers don't share a MATLAB preferences directory (prefdir) with your client).

    I suggest you create your pool of size 48 with an IdleTimeout of Inf, like this:

    num_workers = str2double(getenv('SLURM_NPROCS'));
    parpool(pc,num_workers,'IdleTimeout',Inf);