Matrix Multiplication

We just worked through a skeleton for a GPU compute application. Now we fill in some of the gaps and take a step forward in our understanding of GPU computations. We implement matrix multiplication as a great illustration of mapping a problem onto a grid and invoking shaders to resolve the problem. Of course, we capture everything we learn in a reusable JavaScript module.

Buffers

The Data

The first thing we need is a representation of a matrix. It is common to package the number of rows, then the number of columns, then the contents of the matrix into a single buffer. The specifics are not that important, as long as we are consistent throughout.


    // First Matrix.
    // The matrix size.
    const matrix1Size = [
        2, // number of rows
        4  // number of columns
    ];
    const matrix1Elements = [
        1, 2, 3, 4,
        5, 6, 7, 8
    ];

Since this is a matrix multiplication, we also need a second matrix.


    // Second Matrix size
    const matrix2Size = [
        4, // number of rows
        2  // number of columns
    ];
    // Second Matrix elements
    const matrix2Elements = [
        1, 2,
        3, 4,
        5, 6,
        7, 8
    ];

The Buffers

We create a buffer for each of our matrices. We'll make this a little more interesting than the standard fare by packing two data types within the same buffer. The size of the matrices is the number of rows and the number of columns, which are integers. The elements of the matrices are arbitrary floating point numbers. These two data sets will be combined into a single buffer. JavaScript has the ArrayBuffer, which is designed for exactly this case.

We create a buffer large enough to hold all of our data. The buffer is mapped at creation, so we can initialize it with our data. This works for initialization even though the buffer is not flagged as MAP_WRITE. The buffer is flagged as STORAGE, which is suited to mapping to data structures in a shader.


            // Allocate a mapped buffer for our data
            const matrixBuffer = device.createBuffer({
                mappedAtCreation: true,
                size: matrixSize.length*Uint32Array.BYTES_PER_ELEMENT
                        + matrixElements.length*Float32Array.BYTES_PER_ELEMENT,
                usage: GPUBufferUsage.STORAGE,
            });

Next we geta an ArrayBuffer that represents the GPU buffer.


            // Get the raw array buffer for the mapped gpu buffer.
            const matrixArrayBuffer = matrixBuffer.getMappedRange();

We populate the first two words of the buffer with the row and column count of the matrix. This constructor for the Uint32Array takes an underlying ArrayBuffer, a byte offset for the beginning of the Uint16Array, and the number of entries in the array. The matrix size is at the beginning of the buffer, so the byte offset is 0, and we have two entries in our size array, so the number of entries is two.


            // The first two elements of the buffer are the matrix rows and cols, a Uint32[2]
            new Uint32Array(matrixArrayBuffer, 0, matrixSize.length).set(matrixSize);

The remainder of the buffer is filled with the matrix elements. Float32Array has a similar constructor to the one used for Uint32Array. This time, though, the offset is 8 bytes, allowing room for two 32-bit integers. For a count we just use the number of elements in the matrix. Looking back at the original creation of the buffer, we see that this matches exactly the amount of space we allocated.


            // The remainder is the matrix elements
            new Float32Array(matrixArrayBuffer,
                matrixSize.length*Uint32Array.BYTES_PER_ELEMENT,
                matrixElements.length).set(matrixElements);

This gives us one ArrayBuffer with all of our data.

Ints and Floats mixed in an ArrayBuffer.

????We show how to map the buffers to data structures within the shader.

We need a third buffer for the results of the matrix multiplications. This buffer is flagged for STORAGE, and also COPY_SRC. We expect then, that this buffer will be copied to another.


            // Result Matrix
            const resultMatrixBufferSize = Float32Array.BYTES_PER_ELEMENT * (2 + firstMatrix[0] * secondMatrix[1]);
            const resultMatrixBuffer = this.#device.createBuffer({
                size: resultMatrixBufferSize,
                usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
            });

And here we have a fourth buffer, flagged as COPY_DST and MAP_READ. We will copy the results from resultMatrixBuffer into this buffer, then map it for reading. We introduce another buffer for this because the MAP_READ flag can only be combined with COPY_DST. If we could, we would map the resultMatrixBuffer directly, but that is not allowed.


            const gpuReadBuffer = this.#device.createBuffer({
                size: resultMatrixBufferSize,
                usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ
            });

A Compute Shader

WebGPU introduces a new shader language WGSL, the WebGPU Shader Language. Luckily, it is not too dissimilar to other shading languages.

Structs

The first thing you see in the WGSL code is the Matrix struct. The struct provides a convenient way to group related fields together. Structs also set up a mapping between the WebGPU buffers and shader variables. In this case we have size and elements fields. Looking back at the data for our matrix buffers we see that the first two numbers are the size of the matrix, and the remainder of the buffer are the elements of the matrix. The matrix elements field has an unbounded size. This unbounded size is only allowed for the last element of a struct.


            struct Matrix {
                size : vec2u,        // The first two entries in the data are the matrix dimensions.
                elements: array<f32> // The last element of a struct can have an undefined size.
            }

This ArrayBuffer corresponds to the shader's Matrix struct.

Atributes

Marked by an @, A WGSL attribute is a modification to an object. The ones in use here specify interactions between the shader and the larger WebGPU API.

For example to populate this instance of the Matrix struct, we will setup firstMatrixBuffer as binding 0 in bind group 0.


            @group(0) @binding(0) var<storage, read> firstMatrix : Matrix;

Scope

This variable declaration also contains a scope and access type. var<storage, read> shows that the firstMatrixBuffer and secondMatrixBuffer are flagged as STORAGE, and that we will read but not write the data.

Entry Point

Another interesting annotation is @compute, which marks the entry point for a compute shader. Along with @compute, we have @workgroup_size(8, 8) which specifies how the computations are logically grouped on the GPU.

When we launch this compute shader, we specify the number of workgroups to launch. The 2D block makes sense since our matrices are 2D data structures. If, for example, we had a 1D problem, we might use @workgroup_size(64).

The workgroup size is actually three-dimensional, just any unspecified dimensions are one. So a size of (8, 8) is actually (8, 8, 1), or (64) is actually (64, 1, 1).

Workgroups make more sense when viewed in light of the GPU architecture. GPUs have a large number of processing units. However, these processing units are not independent of one another. Generally they are organized into groups with a common controller and a small local memory. These groups execute the same code and instructions at the same time, just with different data. Common sizes for these groups are 32 compute units for NVIDIA called a warp, or 64 units for AMD referred to as a wavefront.

The reason this is important to us is that each warp or wavefront runs at most one, or a part of one workgroup. This means that if we select a small workgroup size we leave a large number of compute units idle. For example with a workgroup size of 1, a 4x4 multiplication matrix would occupy a single compute unit in each of 16 warps, with the remainder of the compute units left idle. The 64 element workgroup size we use is a common and flexible choice.

A 2x2 collection of 8x8 workgroups for a 12x14 matrix.

Now we see the full compute shader, the actual matrix multiplication is pretty straight forward.


    struct Matrix {
      size : vec2u,        // The first two entries in the data are the matrix dimensions.
      elements: array<f32> // The last element of a struct can have an undefined size.
    }

    // Input, read only, matrices.
    @group(0) @binding(0) var<storage, read> firstMatrix : Matrix;
    @group(0) @binding(1) var<storage, read> secondMatrix : Matrix;
    // Output, writable, matrix.
    @group(0) @binding(2) var<storage, read_write> resultMatrix : Matrix;

    @compute @workgroup_size(8, 8)
    fn main(@builtin(global_invocation_id) global_id : vec3u) {
      // Skip invocations when work groups exceed the actual problem size
      if (global_id.x >= firstMatrix.size.x || global_id.y >= secondMatrix.size.y) {
        return;
      }

      resultMatrix.size = vec2u(firstMatrix.size.x, secondMatrix.size.y);

      // Some constants in the loop
      let resultCell = vec2(global_id.x, global_id.y);
      let firstMatrixSizeY = firstMatrix.size.y;
      let secondMatrixSizeY = secondMatrix.size.y;
      let resultXOffset = resultCell.x * firstMatrixSizeY;
      var result = 0.0;
      for (var i = 0u; i < firstMatrixSizeY; i = i + 1u) {
        let a = i + resultXOffset;
        let b = resultCell.y + i * secondMatrixSizeY;
        result = result + firstMatrix.elements[a] * secondMatrix.elements[b];
      }

      let index = resultCell.y + resultCell.x * secondMatrixSizeY;
      resultMatrix.elements[index] = result;
    }

We define the shader to the GPU through a shader module. The shader module simply takes the shader source, and an optional label. We could have assigned the shader source to a variable, or we can include it directly in the shader module. This shader module, containing a compute shader entry point, will later be included in a compute pipeline.


            shaderModule = device.createShaderModule({
                label: 'matrix multiply shader',
                code: shaderSrc
            });

Next we need to associate the buffers we have defined with the data structures within the shader. This is done with a bind group layout. All of our input data structures are annotated with @group(0) so we only have one bind group, 0, and we only have one layout. The binding element in each layout entry corresponds to the @binding in the shader. We see all the entries are visible to the compute shader. Entries 0 and 1 are the input matrices and read only. Entry 2 simply states a buffer type of "storage", which is a writable storage type.


            // ID, consumer and type for each buffer.
            bindGroupLayout = device.createBindGroupLayout({
                entries: [
                    {
                        binding: 0,
                        visibility: GPUShaderStage.COMPUTE,
                        buffer: {
                            type: "read-only-storage"
                        }
                    },
                    {
                        binding: 1,
                        visibility: GPUShaderStage.COMPUTE,
                        buffer: {
                            type: "read-only-storage"
                        }
                    },
                    {
                        binding: 2,
                        visibility: GPUShaderStage.COMPUTE,
                        buffer: {
                            type: "storage"
                        }
                 }
                ]
            });

The pipeline specifies roughly what we need to run a single instance of our compute shader. We specify the bind group layout, which enumerates the types and availability of our resources. The compute pipeline also specifies the shader, and the entry point to the shader to use for these computations. Our shader has a single entry point main, however, more complex shaders can have multiple entry points.


            computePipeline = device.createComputePipeline({
                layout: device.createPipelineLayout({
                    bindGroupLayouts: [bindGroupLayout]
                }),
                compute: {
                    module: shaderModule,
                    entryPoint: "main"
                }
            });

Commands

Now that we have all our ducks, or rather resources, in a row, we can start doing things. We issue commands to the GPU, starting with a command encoder.


            const commandEncoder = device.createCommandEncoder();

A compute pass contains the commands needed to run a compute shader. The first thing we do is to identify the resources that we need. The compute pipeline is associated with this compute pass, and we identify our bindGroup as bind group 0.

One of the trickier parts is the dispatchWorkgroups call. The arguments are counts of work groups in the x, and optionally y and z directions. We saw earlier an example of 8x8 workgroups laid out for a 12x14 matrix. In this example, our result is a 2x2 matrix, so it fits easily in a single work group.


            const passEncoder = commandEncoder.beginComputePass();
            passEncoder.setPipeline(computePipeline);
            passEncoder.setBindGroup(0, bindGroup);
            const workgroupCountX = Math.ceil(firstMatrixSize[0] / 8);
            const workgroupCountY = Math.ceil(secondMatrixSize[1] / 8);
            passEncoder.dispatchWorkgroups(workgroupCountX, workgroupCountY);
            passEncoder.end();

The last entry in our command stream is to copy our results into a mappable buffer. This is in keeping with our specific scenario where we carry out computations and examine the results. Other scenarios may do multiple iterations between recoveries of results, or there may be a ping pong between computations and visualizations without ever fetching any results.


            // Encode commands for copying buffer to buffer.
            commandEncoder.copyBufferToBuffer(
                resultMatrixBuffer,    // source buffer
                0,                     // source offset
                gpuReadBuffer,         // destination buffer
                0,                     // destination offset
                resultMatrixBufferSize // number of bytes to copy
            );

Now that we have all the commands, we finish with the command encoder. This yields a GPUCommandBuffer, which contains the recorded commands in a form to be submitted to the GPU.


            // Submit GPU commands.
            const gpuCommands = commandEncoder.finish();
            device.queue.submit([gpuCommands]);

To get the results of our operation, we request ownership of the gpuReadBuffer be transferred back to the CPU. We have to wait for this operation to complete. It doesn't complete until all the prior operations with this buffer are finished. For us, this means it waits for the matrix multiplication to complete, and for the finished result to be copied into the mappable buffer.


            // Read buffer.
            await gpuReadBuffer.mapAsync(GPUMapMode.READ);
            const arrayBuffer = gpuReadBuffer.getMappedRange();

The first thing we do is make a local copy of the array buffer. This is independent of the GPU mapped buffer, which we can then dispose of.


            const arrayBufferLocal = arrayBuffer.slice();

Promptly dispose of any GPU resources we can. In the context of the MatrixMultiplier these resources will be recreated with the next compute invocation.


            gpuReadBuffer.unmap();
            firstMatrixBuffer.destroy();
            secondMatrixBuffer.destroy();
            resultMatrixBuffer.destroy();
            gpuReadBuffer.destroy();

This should look familiar from the setup. We layer a 2 element Uint32Array over the beginning of the ArrayBuffer for the matrix size, and allocate the remainder of the ArrayBuffer to a Float32Array containing the matrix elements.


            {
                size: new Uint32Array(arrayBufferLocal, 0, 2),
                elements: new Float32Array(arrayBufferLocal, 2*Uint32Array.BYTES_PER_ELEMENT,
                                            firstMatrixSize[0] * secondMatrixSize[1])
            };

Now we use a bit of mathml to display the matrices, and everything looks great.

Interesting things happen when we move beyond integers. Any real world scenario requires floating point numbers. Computers are surprisingly limited in their representation of, and precision in dealing with, floating point numbers. Keeping this in mind, we can take the 32-bit gpu compute shader floating point representation as accurate enough for illustrative simulations, but be circumspect of their use in real world applications. Luckily, there are 64-bit and even infinite precision options available when they are appropriate.

Just for fun. let's see what happens when we use JavaScript's intrinsic 64 bit arithmetic.

There is noticeable improvement in the accuracy of our results, even for such a simple calculation.

https://developer.chrome.com/docs/capabilities/web-apis/gpu-compute https://public.websites.umich.edu/~smeyer/cuda/Preso07-OpenCL.pdf Important concepts for GPU compute.