Block Size Tuning

This time we fit the problem within one workgroup by increasing the amount of work done by each thread. Each thread computes a block of results such that the block size multiplied by the work group size matches the problem size. We measure the performance impact of this optimization, yielding disappointing results. Feel free to skip this section if you are not interested in this type of optimization.

We have been working on improving the performance of our code. The original version submitted individual requests to the GPU queue for each step of the simulation. Our first revision submitted all the steps together in a single request, and produced a noticeable performance gain. Continuing with the theme of simplifying the device queue interactions, we increased the workgroup size and moved the step count into the compute shader, and again produce a strong performance improvement. A natural alternative to increasing the workgroup size is to increase the amount of work done on each thread. This is the technique we examine next.

Block Size

FPS:

{| Ψ (x, t) |}^{2} vs x

with extended block size.

An important aspect of the last effort was increasing the workgroup size to encompass the entire problem. This allowed us to use storageBarriers to coordinate the execution of our loops. Another way to fit the problem within a single workgroup, and thus maintain the effectiveness of the storageBarrier, is to increase the amount of work done on each thread while maintaining the smaller workgroup size.

We see that the first thread computes the first 16 wave function values, the second thread computes the next 16 values, on to the final thread computing the last 16 wave function values.

So far we have used the compute shader thread identity directly.


    let index = global_id.x;

Then computed the updates to the wave function based on this index.


    updatedWaveFunction[index] = ...

This time we introduce the concept of a block, which is updating multiple wave function values with a single invocation of the compute shader. Start with the block size, which is the number of wave function values that are updated on each invocation.


    override blockSize = 16u;

We then use the thread identity to define the start of the block


    let firstPsiIndex = global_id.x*blockSize;

And loop over the block of wave function updates.


    for (var k=0u; k<blockSize; k++) {
      let thisPsiIndex = firstPsiIndex+k;

      updatedWaveFunction[thisPsiIndex] = ...

With all these changes we revert the workgroup size back to 64.


    @compute @workgroup_size(64)

Finally, we change the name of the compute shader entry point to timeSteps because it now executes multiple steps of our simulation.

We pull this all together into a modified shader.


    struct Parameters {
        dt: f32,          // The time step, Δt.
        xResolution: u32, // The number of points along the x-axis, the number of elements in the array.
        length: f32       // The physical length for our simulation.
    }

    // group 0, things that never change within a simulation.
    // The parameters for the simulation
    @group(0) @binding(0) var<storage, read> parameters: Parameters;

    //group 1, changes on each iteration
    // Initial wave function at t.
    @group(1) @binding(0) var<storage, read_write> waveFunction : array<vec2f>;
    // The updated wave function at t+Δt.
    @group(1) @binding(1) var<storage, read_write> updatedWaveFunction : array<vec2f>;

    override iterations = 250;
    // Our problem size / workgroup size = 1024 / 64 = 16;
    override blockSize = 16u;

    @compute @workgroup_size(64)
    fn timeSteps(@builtin(global_invocation_id) global_id : vec3u)
     {
      // waveFunction and updatedWaveFunction have the same size.
      let dx = parameters.length / f32(parameters.xResolution);
      let dx22 = dx*dx*2.0;
      let firstPsiIndex = global_id.x*blockSize;

      for (var i=0; i<iterations; i++)
      {
          for (var k=0u; k<blockSize; k++)
          {
            let thisPsiIndex = firstPsiIndex+k;
            let waveFunctionAtX = waveFunction[thisPsiIndex];
            let waveFunctionAtXPlusDx = waveFunction[min(thisPsiIndex+1, parameters.xResolution-1)];
            let waveFunctionAtXMinusDx = waveFunction[max(thisPsiIndex-1, 0)];

            updatedWaveFunction[thisPsiIndex].x = waveFunctionAtX.x
                        - ((waveFunctionAtXPlusDx.y - 2.0*waveFunctionAtX.y + waveFunctionAtXMinusDx.y)
                            / dx22) * parameters.dt;

            updatedWaveFunction[thisPsiIndex].y = waveFunctionAtX.y
                        + ((waveFunctionAtXPlusDx.x - 2.0*waveFunctionAtX.x + waveFunctionAtXMinusDx.x)
                            / dx22) * parameters.dt;
          }
          storageBarrier();

          for (var k=0u; k<blockSize; k++)
          {
            let thisPsiIndex = firstPsiIndex+k;
            let waveFunctionAtX = updatedWaveFunction[thisPsiIndex];
            let waveFunctionAtXPlusDx = updatedWaveFunction[min(thisPsiIndex+1, parameters.xResolution-1)];
            let waveFunctionAtXMinusDx = updatedWaveFunction[max(thisPsiIndex-1, 0)];

            waveFunction[thisPsiIndex].x = waveFunctionAtX.x
                      - ((waveFunctionAtXPlusDx.y - 2.0*waveFunctionAtX.y + waveFunctionAtXMinusDx.y)
                          / dx22) * parameters.dt;

            waveFunction[thisPsiIndex].y = waveFunctionAtX.y
                        + ((waveFunctionAtXPlusDx.x - 2.0*waveFunctionAtX.x + waveFunctionAtXMinusDx.x)
                            / dx22) * parameters.dt;
          }
          storageBarrier();
      }
    }

Task Manager

We have three interesting alternatives with tradeoffs for performance against code complexity, and portability.

Single Queue Request

This is our first optimization where we simply coalesced many GPU commands into a single request. It produced a good performance benefit with a small effort. This should be your base case as it represents minimal effort and is portable across WebGPL implementations.

The graphics engine utilization peaks at just over 25%.

The memory copy engine is more heavily loaded, but for a shorter time. This is from doing the rendering work.

Shader Loop Performance

We eliminated most of the GPU commands by moving the loop over simulation steps within the compute shader. This also required expanding the workgroup size to fit the entire problem. We then execute a step in the simulation with a single dispatch of the larger workgroup. This produced a surprising gain in efficiency. However, this approach has issues with portability to systems with limited workgroup sizes.

The graphics engine utilization for the large workgroup peaks at just under 10%.

The memory copy engine is approximately 35%.

Large Block Size Performance

There is no noticeable difference in the frame rate, however, the task manager shows a clear regression in the resources consumed on the GPU. The memory copy engine seems to be doing a bit more work, probably because of the distributed reads and writes of the wave function data.

However, while the GPU workload is higher than for the large workgroup case, it is still a bit better than the base case. It is up to you whether the additional work is worth the gain in efficiency.

The graphics engine utilization for the large block size peaks at just under 20%.

The memory copy engine seems to be doing a bit more work.