Workgroup Size Tuning

We add synchronization to loops within the compute shader and explore override constants. However, this approach depends on work group size limits that are not universally available and should be used with caution. Feel free to skip this section if you are not interested in this type of optimization.

Storage Barriers

FPS:

{| Ψ (x, t) |}^{2} vs x

with storage barriers.

In the previous section we saw that moving the loop to within the compute shader lost the synchronization on resources provided by the GPU command queue. Can we add such a synchronization to the shader?

Indeed, we can explicitly add synchronization to the compute shader, with, for example a storageBarrier. The storageBarrier guarantees that all threads within the workgroup have reached the barrier, and that all storage writes are complete. Only then are the threads allowed to continue beyond the barrier.

Pay particular attention to that restriction, the storageBarrier ony effects threads within the same workgroup.

For storage barriers to work, we need a larger workgroup, one large enough to encompasses the entire problem. That is we need to expand the workgroup size in the shader.

This immediately raises a few interesting questions. Does your version of WebGPU allow such a large workgroup? And if it supports large workgroups, how do we enable them? And finally, how do such large workgroups execute on the GPU?

Workgroup Size Limits

When we first sketched out the steps for implementing computations on the GPU, we mentioned resource limits. These limits have base values that are available across WebGPU systems, but there may be larger, system specific, limits available. To make use of these higher limits they must be explicitly called out when we allocate the WebGPU device.

The two limits that we encounter are maxComputeWorkgroupSizeX and maxComputeInvocationsPerWorkgroup.

The maxComputeWorkgroupSizeX is the maximum size of a compute workgroup in the X direction. We set this with something like:


    @compute @workgroup_size(1024)

where the y and z workgroup sizes are implicitly 1.

maxComputeInvocationsPerWorkgroup limits the total size of our of workgroups, which is workgroup_sizeX * workgroup_sizeY * workgroup_sizeZ, which for us is 1024*1*1.

First, we check to see the system specific values for these limits, and ensure that they are large enough for our purposes. Let's add a quick method to our WebGPUCompute class to facilitate this. Of course, we can not deploy this large workgroup version on systems that fail this check.


  async checkLimit(limitName, targetValue)
  {
    const adapter = await this.getAdapter();
    const limitValue = adapter.limits[limitName];
    return limitValue <= targetValue;
  }

Limit	Available Limit	Default Limit

As we see, each of these have a base limit of 256, however, on many systems larger limits of 1024 or greater are available.

Activating Limits

On systems with large enough workgroup size limits, explicitly activate those limits when we request the device. We request the limits to match exactly the problem size. We will see that this slightly simplifies some later tasks, such as we now dispatch exactly one workgroup, and we do not need to check the invocation ID against the problem size.


  const deviceDescriptor = {
    requiredLimits: {
      maxComputeWorkgroupSizeX : xResolution,
      maxComputeInvocationsPerWorkgroup: xResolution
    }
  }

  device = await adapter.requestDevice(deviceDescriptor);

Override Constants

Adjusting the workgroup size to match our problem size is central to our plan. The trick is that the @workgroup_size must be a constant. Luckily, there is a kind of, sort of, almost, constant that fits our needs. These overridable constants are set during the pipeline creation, and so are constant over the lifetime of the shader. You can even provide a default, which is used if no explicit value is provided.


  computePipeline = device.createComputePipeline({
    label: "Tuned shader pipeline.",
    layout: device.createPipelineLayout({
      bindGroupLayouts: [parametersBindGroupLayout, waveFunctionBindGroupLayout]
    }),
    compute: {
      module: timeStepShaderModule,
      entryPoint: "timeSteps",
      constants: {
        xResolution: xResolution,
        iterations: iterations
      }
    }
  });

For this compute pipeline, we set the problem size, xResolution and, for flexibility, we provide opportunity to set the iterations count for our loop.


  override iterations = 250;
  override xResolution : u32 = 1024;

These variable declarations near the top of the shader set up the override constants, along with their defaults.

We then use this xResolution in the @workgroup_size directive. Similarly, use this xResolution everywhere we previously used parameters.xResolution. Indeed, in a real world application we would eliminate parameters.xResolution. However, here we keep it for backwards compatibility with the other shaders.


  @compute @workgroup_size(xResolution)

We also use the iterations constant as a limit for the shader for loop.


    for (var i=0; i<iterations; i++)
    {...}


    struct Parameters {
        dt: f32,          // The time step, Δt.
        xResolution: u32, // The number of points along the x-axis, the number of elements in the array.
        length: f32       // The physical length for our simulation.
    }

    // group 0, things that never change within a simulation.
    // The parameters for the simulation
    @group(0) @binding(0) var<storage, read> parameters: Parameters;

    //group 1, changes on each iteration
    // Initial wave function at t.
    @group(1) @binding(0) var<storage, read_write> waveFunction : array<vec2f>;
    // The updated wave function at t+Δt.
    @group(1) @binding(1) var<storage, read_write> updatedWaveFunction : array<vec2f>;

    override iterations = 250;

    @compute @workgroup_size(xResolution)
    fn timeSteps(@builtin(global_invocation_id) global_id : vec3u)
     {
      let index = global_id.x;
      // waveFunction, and updatedWaveFunction have the same size.
      let dx = parameters.length / f32(xResolution);
      let dx22 = dx*dx*2.0;

      for (var i=0; i<iterations; i++)
      {
          var waveFunctionAtX = waveFunction[index];
          var waveFunctionAtXPlusDx = waveFunction[min(index+1, xResolution-1)];
          var waveFunctionAtXMinusDx = waveFunction[max(index-1, 0)];

          updatedWaveFunction[index].x = waveFunctionAtX.x
                      - ((waveFunctionAtXPlusDx.y - 2.0*waveFunctionAtX.y + waveFunctionAtXMinusDx.y)
                          / dx22) * parameters.dt;

          updatedWaveFunction[index].y = waveFunctionAtX.y
                      + ((waveFunctionAtXPlusDx.x - 2.0*waveFunctionAtX.x + waveFunctionAtXMinusDx.x)
                          / dx22) * parameters.dt;
          storageBarrier();

          waveFunctionAtX = updatedWaveFunction[index];
          waveFunctionAtXPlusDx = updatedWaveFunction[min(index+1, xResolution-1)];
          waveFunctionAtXMinusDx = updatedWaveFunction[max(index-1, 0)];

          waveFunction[index].x = waveFunctionAtX.x
                      - ((waveFunctionAtXPlusDx.y - 2.0*waveFunctionAtX.y + waveFunctionAtXMinusDx.y)
                          / dx22) * parameters.dt;

          waveFunction[index].y = waveFunctionAtX.y
                      + ((waveFunctionAtXPlusDx.x - 2.0*waveFunctionAtX.x + waveFunctionAtXMinusDx.x)
                          / dx22) * parameters.dt;
          storageBarrier();
      }
    }

Finally, we could produce a very similar effect with variable substitution into the shader string. JavaScript template literals provide a convenient way to embed JavaScript variables in strings. In fact, template literals are marked with back ticks, `...`, which are what we already use to represent our shaders as a single multiline string.

Task Manager

Now we have the opportunity to compare the GPU load for our different versions. The first version has a high load on the GPU over a long time, in fact it extends beyond the length of the plot.

The graphics engine utilization peaks at just over 25%.

The memory copy engine is more heavily loaded, but for a shorter time. Likely this is from doing the rendering work.

The updated code produces a radically different plot, indicating that our tuning is having a significant impact. The load on the GPU is far lower, and for a much shorter time. The memory copy load is higher, but this is simply because we are rendering more frames per second. I wager that the area under the two memory copy curves is roughly the same.

The graphics engine utilization ranging from 50-80%.

The memory copy engine is also worth a look. It is consistently under 10%.