I have coded a WGSL compute shader that outputs the result given an input as argument.
Now I need to run this shader many times using different inputs. All of the compute shader steps should be the same every time. I could indeed create a new pipeline each time and get the right results, but the execution is exceedingly slow, possibly due to all the overheads of creating a new pipeline / initializing data in buffers, etc.
How can I use my pre-created WGSL pipeline multiple times (on different inputs) without creating a new pipeline every time?
let adapter = await navigator.gpu.requestAdapter();
let device = await adapter.requestDevice();
let module = device.createShaderModule({code: `@group(0) @binding(0) var<storage, read_write> sample: array<u32, 720>;
@group(0) @binding(1) var<storage, read_write> table: array<array<u32, 720>>;
@group(0) @binding(2) var<storage, read_write> result: array<u32>;
@compute @workgroup_size(1,1,1) fn computeThis (@builtin(global_invocation_id) id: vec3<u32>)
{
var diff : u32 = 0;
for (var i : u32 = 0; i < 720; i++)
{
diff += (table[id.x][i] - sample[i])*(table[id.x][i] - sample[i]);
}
result[id.x] = diff;
}
`, });
let pipeline = device.createComputePipeline({layout: 'auto', compute: {module}});
let sampleBuffer = device.createBuffer({size: sample.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST});
let tableBuffer = device.createBuffer({size: table.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST});
let inputBuffer = device.createBuffer({size: input.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST});
let resultBuffer = device.createBuffer({size: input.byteLength, usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST});
let bindGroup = device.createBindGroup({layout: pipeline.getBindGroupLayout(0), entries: [{binding: 0, resource: { buffer: sampleBuffer }},{binding: 1, resource: { buffer: tableBuffer }},{binding: 2, resource: { buffer: inputBuffer }}]});
let encoder = device.createCommandEncoder();
let pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(LEN,1,1);
pass.end();
encoder.copyBufferToBuffer(inputBuffer, 0, resultBuffer, 0, resultBuffer.size);
device.queue.writeBuffer(sampleBuffer, 0, sample);
device.queue.writeBuffer(tableBuffer, 0, table);
device.queue.writeBuffer(inputBuffer, 0, input);
device.queue.submit([encoder.finish()]);
await resultBuffer.mapAsync(GPUMapMode.READ);
let result = new Uint32Array(resultBuffer.getMappedRange().slice());
resultBuffer.unmap();
inputBuffer.unmap();
sampleBuffer.unmap();
tableBuffer.unmap();
How can I use my pre-created WGSL pipeline multiple times (on different inputs)
You create different buffers and bindGroups
let pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(LEN,1,1);
pass.setBindGroup(0, bindGroup2);
pass.dispatchWorkgroups(LEN,1,1);
pass.setBindGroup(0, bindGroup3);
pass.dispatchWorkgroups(LEN,1,1);
pass.end();
And or upload the new data to the same buffer and then run your process again.(though that would be slower)
Note: GPUs cores are extremely slow @workgroup_size(1,1,1)
. In fact in this article, A single core on an M1 Mac is 30x slower than JavaScript. A single core on an NVidia 2070 Super is 19x slower than JavaScript on an AMD Ryzen 9 3900XT
GPUs get their speed from massive parallelization and you generally need to use more than a workgroup of size (1,1,1) to take advantage of that parallelization
Required disclosure: I'm a contributor to the article linked