gpushadercompute-shaderwebgpuwgsl

How to churn different inputs using only single premade WGSL pipeline?


I have coded a WGSL compute shader that outputs the result given an input as argument.

Now I need to run this shader many times using different inputs. All of the compute shader steps should be the same every time. I could indeed create a new pipeline each time and get the right results, but the execution is exceedingly slow, possibly due to all the overheads of creating a new pipeline / initializing data in buffers, etc.

How can I use my pre-created WGSL pipeline multiple times (on different inputs) without creating a new pipeline every time?

        let adapter = await navigator.gpu.requestAdapter();
        let device = await adapter.requestDevice();
        let module = device.createShaderModule({code: `@group(0) @binding(0) var<storage, read_write> sample: array<u32, 720>;
                                                       @group(0) @binding(1) var<storage, read_write> table: array<array<u32, 720>>;
                                                       @group(0) @binding(2) var<storage, read_write> result: array<u32>;
                                                       @compute @workgroup_size(1,1,1) fn computeThis (@builtin(global_invocation_id) id: vec3<u32>) 
                                                       {
                                                            var diff : u32 = 0;
                                                            
                                                            for (var i : u32 = 0; i < 720; i++)
                                                            {
                                                                diff += (table[id.x][i] - sample[i])*(table[id.x][i] - sample[i]);
                                                            }
                                                            
                                                            result[id.x] = diff;
                                                       }
                                                      `, });

        let pipeline = device.createComputePipeline({layout: 'auto', compute: {module}});
        let sampleBuffer = device.createBuffer({size: sample.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST});
        let tableBuffer = device.createBuffer({size: table.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST});
        let inputBuffer = device.createBuffer({size: input.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST});
        let resultBuffer = device.createBuffer({size: input.byteLength, usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST});
        let bindGroup = device.createBindGroup({layout: pipeline.getBindGroupLayout(0), entries: [{binding: 0, resource: { buffer: sampleBuffer }},{binding: 1, resource: { buffer: tableBuffer }},{binding: 2, resource: { buffer: inputBuffer }}]});
        let encoder = device.createCommandEncoder();
        let pass = encoder.beginComputePass();
        pass.setPipeline(pipeline);
        pass.setBindGroup(0, bindGroup);
        pass.dispatchWorkgroups(LEN,1,1);
        pass.end();
        encoder.copyBufferToBuffer(inputBuffer, 0, resultBuffer, 0, resultBuffer.size);
        device.queue.writeBuffer(sampleBuffer, 0, sample);
        device.queue.writeBuffer(tableBuffer, 0, table);
        device.queue.writeBuffer(inputBuffer, 0, input);
        device.queue.submit([encoder.finish()]);
        await resultBuffer.mapAsync(GPUMapMode.READ);
        let result = new Uint32Array(resultBuffer.getMappedRange().slice());
        resultBuffer.unmap();
        inputBuffer.unmap();
        sampleBuffer.unmap();
        tableBuffer.unmap();


Solution

  • How can I use my pre-created WGSL pipeline multiple times (on different inputs)

    You create different buffers and bindGroups

           let pass = encoder.beginComputePass();
            pass.setPipeline(pipeline);
            pass.setBindGroup(0, bindGroup);
            pass.dispatchWorkgroups(LEN,1,1);
            pass.setBindGroup(0, bindGroup2);
            pass.dispatchWorkgroups(LEN,1,1);
            pass.setBindGroup(0, bindGroup3);
            pass.dispatchWorkgroups(LEN,1,1);
            pass.end();
    

    And or upload the new data to the same buffer and then run your process again.(though that would be slower)

    Note: GPUs cores are extremely slow @workgroup_size(1,1,1). In fact in this article, A single core on an M1 Mac is 30x slower than JavaScript. A single core on an NVidia 2070 Super is 19x slower than JavaScript on an AMD Ryzen 9 3900XT

    GPUs get their speed from massive parallelization and you generally need to use more than a workgroup of size (1,1,1) to take advantage of that parallelization

    Required disclosure: I'm a contributor to the article linked