Performance Measurements
Looking at the simulation, our intuition is that there is room for improvement with its performance. However, intuition is not enough, we must measure. We introduce two measurements. The first technique measures the frames per second using pure javascript. This captures a holistic view: JavaScript, WebGPU api interactions, and shader execution. The second uses WebGPU timing to track the execution of the shaders. Combining these two methods allows us to identify bottlenecks, and measure the impact of our performance tuning.
An FPS Counter
Looking at the simulation, our intuition is that there is room for improvement with its performance. However, intuition is not enough, we must measure. A first pass at measurement is an FPS counter.
Luckily, the mechanism we use to generate frames for the animation,
requestAnimationFrame
,
has a natural way to track the generated frames per second. The requestAnimationFrame
method call provides a millisecond timestamp, we use this to track a running average of the frames per
second.
The first thing is to add a placeholder for the indicator to the wave function display.
FPS: <span id="fps0"></span>
On every generation of a frame, we update the FPS counter. We capture the start time on the first iteration and thereafter use it to compute the average FPS.
const fpsDisplay = document.getElementById(fpsID);
const SECONDS_PER_MILLISECOND = .001;
let nframes = 0;
let firstFrameTime = 0.0;
...
function nextFrame(currentFrameTime)
{
...
if (nframes == 0) {
firstFrameTime = currentFrameTime;
} else {
fpsDisplay.innerText = (nframes / (currentFrameTime - firstFrameTime)).toFixed(2);
}
}
This yields a subpar 11 to 14 FPS. This FPS count verifies that there is room for improvement, and provides a baseline against which we can measure the impact of our changes. This FPS technique is basic JavaScript and is available across platforms.
Timestamp Queries
Queries allow us to retrieve data from the WebGPU queue. Timestamp queries, specifically, generate a nanosecond timestamp for the requested point in the command queue. This allows for a much more fine-grained and more accurate timing of WebGPU applications. The downside is that they are an optional feature, and may not be available on a given system.
Shader execution time: 0.00 seconds
Let's start with a couple of constants that we use to query for and set up the timestamp feature.
const TIMESTAMP_QUERY_FEATURE_NAME = "timestamp-query";
const TIMESTAMP_QUERY_TYPE = "timestamp";
The first step is to determine if timestamp queries are supported, and capture it into a variable. While we don't show it explicitly here, the timestamp query code is guarded by checks against this variable.
hasTimestampQuery = adapter.features.has(TIMESTAMP_QUERY_FEATURE_NAME);
For systems that support timestamp queries, we list it as a required feature when we get the device.
device = await adapter.requestDevice({
requiredFeatures: [TIMESTAMP_QUERY_FEATURE_NAME]
});
Queries are submitted and tracked with a query set. We create a set of two timestamp queries, one for the top of the compute pass, and one for the end. The difference between these timestamps indicates the time consumed by our compute shader.
timestampQueries = await device.createQuerySet({
type: TIMESTAMP_QUERY_TYPE,
count: 2,
});
We need to set up a couple of buffers. Each timestamp query produces a 64 bit int, so we allocate enough space in each buffer to hold two 64 bit ints. The first buffer will be loaded with the results of the query.
timestampBuffer = device.createBuffer({
label: "Time stamp query buffer",
size: timestampQueries.count * BigInt64Array.BYTES_PER_ELEMENT,
usage: GPUBufferUsage.QUERY_RESOLVE | GPUBufferUsage.COPY_SRC,
});
The query results are then copied to a second buffer, which is mapped to the CPU where we can access the data. Some of the APIs that WebGPU is built on allow these buffers to be combined, but some do not, so WebGPU requires them to be separate buffers.
timestampCopyBuffer = device.createBuffer({
label: "Time stamp mappable buffer",
size: timestampBuffer.size,
usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
});
The timestampWrites
property in the descriptor describes when timestamp
data is collected, and which query is used to collect the data. timestampWrites
has three properties: querySet
, beginningOfPassWriteIndex
,
and endOfPassWriteIndex
, where either of the write index properties can
be omitted.
The querySet
contains the time stamp queries that will be executed at the
indicated points in the command queue.
The beginningOfPassWriteIndex
indicates which, if any, of the querySet
is to be executed at the beginning of the pass.
The endOfPassWriteIndex
indicates which, if any, of the querySet
is to be executed at the end of the pass.
This means tha we can only collect timing data at the beginning and end of a compute pass. Again, this restriction stems from requiring implementation across multiple underlying graphics APIs.
You may see commandEncoder.writeTimestamp
calls in some examples. However, this is
an old implementation that has been dropped. These will likely need to be replaced by the
descriptor with attributes as shown here.
const passEncoder = commandEncoder.beginComputePass({
timestampWrites: {
querySet: timestampQueries,
beginningOfPassWriteIndex: 0,
endOfPassWriteIndex: 1
}
});
...
passEncoder.end();
Once the compute pass is finished, invoke
resolveQuerySet
to copy the query results into a buffer.
commandEncoder.resolveQuerySet(
timestampQueries, // The GPUQuerySet
0, // The first query
timestampQueries.count, // The query count
timestampBuffer, // GPUBuffer destination
0); // Destination offset
Then we copy the timestamps to a mappable buffer, which allows us to access them from the cpu side.
commandEncoder.copyBufferToBuffer(
timestampBuffer, // GPUBuffer we copy from
0, // Start at the beginning of the buffer
timestampCopyBuffer, // GPUBuffer we copy to
0, // Starting at the beginning of the destination
timestampBuffer.size // Copy the full contents of the source buffer
Now we can map the timestampCopyBuffer
to the CPU.
await timestampCopyBuffer.mapAsync(GPUMapMode.READ);
Wrap the buffered data in a typed array to make it available to JS. In this case we have 64 bit ints as nanosecond time stamps.
const timestampArrayBuffer = timestampCopyBuffer.getMappedRange();
const timestampNanoseconds = new BigInt64Array(timestampArrayBuffer);
The difference between the timestamps is the time consumed by our shader.
deltaT = timestampNanoseconds[1] - timestampNanoseconds[0];
Interestingly, this is on the order of seconds, on my middling test system, so it is very small when compared with the total time needed for a frame.
Of course, don't forget to return the memory to the CPU for later use.
timestampCopyBuffer.unmap();
Now that we have made some performance measurements, we see that the compute shader is actually very fast, however the frame rate leaves much to be desired. In the next section we look at improving our use of the WebGPU API to improve performance.
We also see that timestamp queries have a significant performance impact. Now that we know the compute shaders are performant, we will remove the code from our simulation. In general, timestamp queries should be used in the development cycle only while you are tuning the shaders, and certainly not carried over to production.
Task Manager
The windows task manager also provides some insight into graphics card performance. Open the task manager, then select the performance tab. Along the right side of the performance tab, select the GPU used by the simulations. Most systems have an integrated and discrete GPU, and allow you to set which of them will be used by your browser. In the worst case you can simply watch the GPU activity and pay attention to the one that is active when you run the simulation.
Original Performance


We see the graphics engine is busy for the entire 60-second duration of our plot, and the copy engine is lightly busy as well. It will be interesting to compare these with later results from tuned versions of the code.
