Performance Tuning
Measurements have lead us to the WebGPU API as the bottleneck. We'll examine and improve our use of the API while measuring the performance impact of our changes.
The initial version
Experience with performance tuning has taught me to look for code within loops, especially code within
loops within loops. We see that nextFrame
is invoked repeatedly, and that the step function
within the nextFrame
method contains a loop. It is clear that this innermost loop in the
step function is where we are spending all our time.
The step function accepts a count for the number of iterations to perform. The function has a default of 20 iterations, however, we supply a much larger value explicitly.
step(count=20)
Looking at the body of the method, our attention is immediately drawn to the queue.submit
method call within the loop. Moreover, the queue.submit
interacts with the GPU over the
PCIe bus, so is very costly.
const workgroupCountX = Math.ceil(xResolution / 64);
for (let i=0; i<count && running; i++)
{
// Created in the loop because it can not be reused after finish is invoked.
const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(computePipeline);
passEncoder.setBindGroup(0, parametersBindGroup);
passEncoder.setBindGroup(1, waveFunctionBindGroup[i%2]);
passEncoder.dispatchWorkgroups(workgroupCountX);
passEncoder.end();
// Submit GPU commands.
device.queue.submit([commandEncoder.finish()]);
}
The tuning
Moving the submit call after the loop opens the way to a number of other optimizations. We no longer need a new compute pass for each iteration of the loop, so we can move the compute pass and command encoder out of the loop as well. This leaves only the bind group and the dispatch of the computations within the loop.
// Recreate this because it can not be reused after finish is invoked.
const commandEncoder = device.createCommandEncoder();
const workgroupCountX = Math.ceil(xResolution / 64);
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(computePipeline);
passEncoder.setBindGroup(0, parametersBindGroup);
for (let i=0; i<count && running; i++)
{
passEncoder.setBindGroup(1, waveFunctionBindGroup[i%2]);
passEncoder.dispatchWorkgroups(workgroupCountX);
}
passEncoder.end();
// Submit GPU commands.
const gpuCommands = commandEncoder.finish();
device.queue.submit([gpuCommands]);
Run this tuned version after the original version is finished. You can try running them at the same time, however, most browsers keep pace with the slowest animation.
Task Manager
Now we have the opportunity to compare the GPU load for our different versions. The first version has a high load on the GPU over a long time, in fact it extends beyond the length of the plot.


The updated code produces a radically different plot, indicating that our tuning is having a significant impact. The load on the GPU is far lower, and for a much shorter time. The memory copy load is higher, but this is simply because we are rendering more frames per second. I wager that the area under the two memory copy curves is roughly the same.


