I have a dataset which includes approximately 2000 digital images. I am using MATLAB to perform some digital image processing to extract trees from the imagery. The script is currently configured to process the images in a parfor
loop on n cores.
The challenge:
I have access to processing time on a University managed supercomputer with approximately 10,000 compute cores. If I submit the entire job for processing, I get put so far back in the tasking queue, a desktop computer could finish the job before the processing starts on the supercomputer. I have been told by support staff that partitioning the 2000 file dataset into ~100 file jobs will significantly decrease the tasking queue time. What method can I use to perform the tasks in parallel using the parfor
loop, while submitting 100 files (of 2000) at a time?
My script is structured in the following way:
datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
fileIndex = find(~[files.isdir]);
parfor ix = 1:length(fileIndex)
% Perform the processing on each file;
end
Similar to my comment I would spontaneously suggest something like
datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
files = files(~[files.isdir]);
% split up the data
N = length(files); % e.g. 20000
jobSize = 100;
jobFiles = mat2cell(files, [jobSize*ones(1,floor(N/jobSize)), mod(N,jobSize)]);
jobNum = length(jobFiles);
% Provide each job to a worker
parfor jobIdx = 1:jobNum
thisJob = jobFiles{jobIdx}; % this indexing allows matlab for transfering
% only relevant file data to each worker
for fIdx = 1:length(thisJob)
thisFile = thisJob(fIdx);
% Perform the processing on each file;
thisFile.name
end
end