Performance Tuning Gone Wrong

Carry the performance tuning to a logical next step, and add a loop to carry out multiple steps of our simulation within the shader. This illustrates the need to test for correctness as well as for performance as a seemingly innocuous change causes problems.

When things go wrong

FPS:

Clearly, something has gone wrong here.
Click to rerun: .

Clearly something has gone wrong here, see if you can figure it out while we work through the changes.

In the previous section we simplified a loop to reduce the number of WebGPU device queue submissions. We can carry this a bit further by adding a loop within the shader.


  let index = global_id.x;
  // Skip invocations when work groups exceed the actual problem size
  if (index >= parameters.xResolution) {
    return;
  }
  // waveFunction, and updatedWaveFunction have the same size.
  let dx = parameters.length / f32(parameters.xResolution);
  let dx22 = dx*dx*2.0;

  for (var i=0; i<10; i++)
  {
    var waveFunctionAtX = waveFunction[index];
    var waveFunctionAtXPlusDx = waveFunction[min(index+1, parameters.xResolution-1)];
    var waveFunctionAtXMinusDx = waveFunction[max(index-1, 0)];

    updatedWaveFunction[index].x = waveFunctionAtX.x
                      - ((waveFunctionAtXPlusDx.y - 2.0*waveFunctionAtX.y + waveFunctionAtXMinusDx.y)
                          / dx22) * parameters.dt;

    updatedWaveFunction[index].y = waveFunctionAtX.y
                      + ((waveFunctionAtXPlusDx.x - 2.0*waveFunctionAtX.x + waveFunctionAtXMinusDx.x)
                          / dx22) * parameters.dt;

    waveFunctionAtX = updatedWaveFunction[index];
    waveFunctionAtXPlusDx = updatedWaveFunction[min(index+1, parameters.xResolution-1)];
    waveFunctionAtXMinusDx = updatedWaveFunction[max(index-1, 0)];

    waveFunction[index].x = waveFunctionAtX.x
                      - ((waveFunctionAtXPlusDx.y - 2.0*waveFunctionAtX.y + waveFunctionAtXMinusDx.y)
                          / dx22) * parameters.dt;

    waveFunction[index].y = waveFunctionAtX.y
                      + ((waveFunctionAtXPlusDx.x - 2.0*waveFunctionAtX.x + waveFunctionAtXMinusDx.x)
                          / dx22) * parameters.dt;
  }

This differs from our original implementation by the addition of a for loop, and within the loop we add a step to update the wave function in the waveFunction from the updatedWaveFunction. Thus, at the top of the loop, the waveFunction array holds the newest values for the wave function.

We make a corresponding change in the number of compute shader steps per frame to account for the change in the amount of work done with each shader invocation.


  // Number of shader compute steps per frame
  const nsteps = 25;


  for (let i=0; i<count && this.#running; i++)
  {
    passEncoder.dispatchWorkgroups(workgroupCountX);
  }

All we have done is add iterations of the same calculation, we would be forgiven for expecting the same results. However, this is not the case. What is happening here?

The trick is to realise that when work is submitted through the device queue, for example with dispatchWorkgroups, it is not executed until prior work that can alter the new works dependencies is finished. When the loop is within the shader, there is no synchronization, and iterations of the loop in one shader that depend on iterations in other shaders are happening out of sync.