Performance Measurements

Looking at the simulation, our intuition is that there is room for improvement with its performance. However, intuition is not enough, we must measure. We introduce two measurements. The first technique measures the frames per second using pure javascript. This captures a holistic view: JavaScript, WebGPU api interactions, and shader execution. The second uses WebGPU timing to track the execution of the shaders. Combining these two methods allows us to identify bottlenecks, and measure the impact of our performance tuning.

An FPS Counter

FPS:

{| Ψ (x, t) |}^{2} vs x

with frame counter.

Looking at the simulation, our intuition is that there is room for improvement with its performance. However, intuition is not enough, we must measure. A first pass at measurement is an FPS counter.

Luckily, the mechanism we use to generate frames for the animation, requestAnimationFrame, has a natural way to track the generated frames per second. The requestAnimationFrame method call provides a millisecond timestamp, we use this to track a running average of the frames per second.

The first thing is to add a placeholder for the indicator to the wave function display.


        FPS: <span id="fps0"></span>

On every generation of a frame, we update the FPS counter. We capture the start time on the first iteration and thereafter use it to compute the average FPS.


  const fpsDisplay = document.getElementById(fpsID);
  const SECONDS_PER_MILLISECOND = .001;
  let nframes = 0;
  let firstFrameTime = 0.0;
  ...
  function nextFrame(currentFrameTime)
  {
    ...
    if (nframes == 0) {
      firstFrameTime = currentFrameTime;
    } else {
      fpsDisplay.innerText = (nframes / (currentFrameTime - firstFrameTime)).toFixed(2);
    }
  }

This yields a subpar 11 to 14 FPS. This FPS count verifies that there is room for improvement, and provides a baseline against which we can measure the impact of our changes. This FPS technique is basic JavaScript and is available across platforms.

Timestamp Queries

Queries allow us to retrieve data from the WebGPU queue. Timestamp queries, specifically, generate a nanosecond timestamp for the requested point in the command queue. This allows for a much more fine-grained and more accurate timing of WebGPU applications. The downside is that they are an optional feature, and may not be available on a given system.

FPS:
Shader execution time: 0.00 seconds

{| Ψ (x, t) |}^{2} vs x

with timestamp queries.

Let's start with a couple of constants that we use to query for and set up the timestamp feature.


  const TIMESTAMP_QUERY_FEATURE_NAME = "timestamp-query";
  const TIMESTAMP_QUERY_TYPE = "timestamp";

The first step is to determine if timestamp queries are supported, and capture it into a variable. While we don't show it explicitly here, the timestamp query code is guarded by checks against this variable.


  hasTimestampQuery = adapter.features.has(TIMESTAMP_QUERY_FEATURE_NAME);

For systems that support timestamp queries, we list it as a required feature when we get the device.


  device = await adapter.requestDevice({
    requiredFeatures: [TIMESTAMP_QUERY_FEATURE_NAME]
  });

Queries are submitted and tracked with a query set. We create a set of two timestamp queries, one for the top of the compute pass, and one for the end. The difference between these timestamps indicates the time consumed by our compute shader.


  timestampQueries = await device.createQuerySet({
    type: TIMESTAMP_QUERY_TYPE,
    count: 2,
  });

We need to set up a couple of buffers. Each timestamp query produces a 64 bit int, so we allocate enough space in each buffer to hold two 64 bit ints. The first buffer will be loaded with the results of the query.


  timestampBuffer = device.createBuffer({
    label: "Time stamp query buffer",
    size: timestampQueries.count * BigInt64Array.BYTES_PER_ELEMENT,
    usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC,
  });

The query results are then copied to a second buffer, which is mapped to the CPU where we can access the data. Some of the APIs that WebGPU is built on allow these buffers to be combined, but some do not, so WebGPU requires them to be separate buffers.


  timestampCopyBuffer = device.createBuffer({
    label: "Time stamp mappable buffer",
    size: timestampBuffer.size,
    usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
  });

The timestampWrites property in the descriptor describes when timestamp data is collected, and which query is used to collect the data. timestampWrites has three properties: querySet, beginningOfPassWriteIndex, and endOfPassWriteIndex, where either of the write index properties can be omitted.

The querySet contains the time stamp queries that will be executed at the indicated points in the command queue.

The beginningOfPassWriteIndex indicates which, if any, of the querySet is to be executed at the beginning of the pass.

The endOfPassWriteIndex indicates which, if any, of the querySet is to be executed at the end of the pass.

This means tha we can only collect timing data at the beginning and end of a compute pass. Again, this restriction stems from requiring implementation across multiple underlying graphics APIs.

You may see commandEncoder.writeTimestamp calls in some examples. However, this is an old implementation that has been dropped. These will likely need to be replaced by the descriptor with attributes as shown here.


  const passEncoder = commandEncoder.beginComputePass({
    timestampWrites: {
      querySet: timestampQueries,
      beginningOfPassWriteIndex: 0,
      endOfPassWriteIndex: 1
    }
  });
  ...
  passEncoder.end();

Once the compute pass is finished, invoke resolveQuerySet to copy the query results into a buffer.


  commandEncoder.resolveQuerySet(
    timestampQueries,       // The GPUQuerySet
    0,                      // The first query
    timestampQueries.count, // The query count
    timestampBuffer,        // GPUBuffer destination
    0);                     // Destination offset

Then we copy the timestamps to a mappable buffer, which allows us to access them from the cpu side.


  commandEncoder.copyBufferToBuffer(
    timestampBuffer,     // GPUBuffer we copy from
    0,                   // Start at the beginning of the buffer
    timestampCopyBuffer, // GPUBuffer we copy to
    0,                   // Starting at the beginning of the destination
    timestampBuffer.size // Copy the full contents of the source buffer

Now we can map the timestampCopyBuffer to the CPU.


  await timestampCopyBuffer.mapAsync(GPUMapMode.READ);

Wrap the buffered data in a typed array to make it available to JS. In this case we have 64 bit ints as nanosecond time stamps.


  const timestampArrayBuffer = timestampCopyBuffer.getMappedRange();
  const timestampNanoseconds = new BigInt64Array(timestampArrayBuffer);

The difference between the timestamps is the time consumed by our shader.


  deltaT = timestampNanoseconds[1] - timestampNanoseconds[0];

Interestingly, this is on the order of $10^{-5}$ seconds, on my middling test system, so it is very small when compared with the total time needed for a frame.

Of course, don't forget to return the memory to the CPU for later use.


  timestampCopyBuffer.unmap();

Now that we have made some performance measurements, we see that the compute shader is actually very fast, however the frame rate leaves much to be desired. In the next section we look at improving our use of the WebGPU API to improve performance.

We also see that timestamp queries have a significant performance impact. Now that we know the compute shaders are performant, we will remove the code from our simulation. In general, timestamp queries should be used in the development cycle only while you are tuning the shaders, and certainly not carried over to production.

Task Manager

The windows task manager also provides some insight into graphics card performance. Open the task manager, then select the performance tab. Along the right side of the performance tab, select the GPU used by the simulations. Most systems have an integrated and discrete GPU, and allow you to set which of them will be used by your browser. In the worst case you can simply watch the GPU activity and pay attention to the one that is active when you run the simulation.

Original Performance

The graphics engine utilization ranging from 50-80%.

The memory copy engine is also worth a look. It is consistently under 10%.

We see the graphics engine is busy for the entire 60-second duration of our plot, and the copy engine is lightly busy as well. It will be interesting to compare these with later results from tuned versions of the code.