Performance Tuning

Measurements have lead us to the WebGPU API as the bottleneck. We'll examine and improve our use of the API while measuring the performance impact of our changes.

The initial version

FPS:
| Ψ x t | 2  vs  x

Experience with performance tuning has taught me to look for code within loops, especially code within loops within loops. We see that nextFrame is invoked repeatedly, and that the step function within the nextFrame method contains a loop. It is clear that this innermost loop in the step function is where we are spending all our time.

The step function accepts a count for the number of iterations to perform. The function has a default of 20 iterations, however, we supply a much larger value explicitly.


step(count=20)
            

Looking at the body of the method, our attention is immediately drawn to the queue.submit method call within the loop. Moreover, the queue.submit interacts with the GPU over the PCIe bus, so is very costly.


  const workgroupCountX = Math.ceil(xResolution / 64);
  for (let i=0; i<count && running; i++)
  {
    // Created in the loop because it can not be reused after finish is invoked.
    const commandEncoder = device.createCommandEncoder();
    const passEncoder = commandEncoder.beginComputePass();
    passEncoder.setPipeline(computePipeline);
    passEncoder.setBindGroup(0, parametersBindGroup);
    passEncoder.setBindGroup(1, waveFunctionBindGroup[i%2]);
    passEncoder.dispatchWorkgroups(workgroupCountX);

    passEncoder.end();
    // Submit GPU commands.
    device.queue.submit([commandEncoder.finish()]);
  }
            

The tuning

Moving the submit call after the loop opens the way to a number of other optimizations. We no longer need a new compute pass for each iteration of the loop, so we can move the compute pass and command encoder out of the loop as well. This leaves only the bind group and the dispatch of the computations within the loop.


  // Recreate this because it can not be reused after finish is invoked.
  const commandEncoder = device.createCommandEncoder();
  const workgroupCountX = Math.ceil(xResolution / 64);
  const passEncoder = commandEncoder.beginComputePass();
  passEncoder.setPipeline(computePipeline);
  passEncoder.setBindGroup(0, parametersBindGroup);
  for (let i=0; i<count && running; i++)
  {
     passEncoder.setBindGroup(1, waveFunctionBindGroup[i%2]);
     passEncoder.dispatchWorkgroups(workgroupCountX);
  }
  passEncoder.end();
  // Submit GPU commands.
  const gpuCommands = commandEncoder.finish();
  device.queue.submit([gpuCommands]);
            
FPS:
Click the button to run this tuned version after the original version is finished: .

Run this tuned version after the original version is finished. You can try running them at the same time, however, most browsers keep pace with the slowest animation.

Task Manager

Now we have the opportunity to compare the GPU load for our different versions. The first version has a high load on the GPU over a long time, in fact it extends beyond the length of the plot.

The graphics engine utilization ranging from 50-80%.
The memory copy engine is also worth a look. It is consistently under 10%.

The updated code produces a radically different plot, indicating that our tuning is having a significant impact. The load on the GPU is far lower, and for a much shorter time. The memory copy load is higher, but this is simply because we are rendering more frames per second. I wager that the area under the two memory copy curves is roughly the same.

The graphics engine utilization peaks at just over 25%.
The memory copy engine is more heavily loaded, but for a shorter time. Likely this is from doing the rendering work.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.