# Speed Bumps

## Floating Point Textures

The main issues with GPGPU are clustered around floating point textures.

• Creating floating point textures
• Writing floating point textures
• Reading floating point textures back from the GPU

This is as expected because floating point textures, while central to GPGPU processing, are well away from the mainstream in computer graphics. Luckily, because these issues are closely related, their solutions are also closely related. If we can't resolve a step with floating point textures, we find a way representing floating point numbers in textures that we can use.

In many cases we will run the simulation and display the results entirely on the GPU. For these cases, the inability to read floating point textures back from the GPU will not be relevant. Indeed, I expect that some will consider this section to be superfluous. It may be unnecessary if you are only interested in the numerical techniques, or if you target audience consistently uses more capable systems. However, for the instructional designer or developer seeking to reach a broad audience this material is essential.

This diagram covers the main elements of deciding whether you need to use unsigned bytes textures, and the process for incorporating them into your project.

Luckily1 each decision point in this process corresponds to a test we already have in place. This means that this process can be automated, and different programs loaded depending on the results of our tests.

The case where we always use floating point textures corresponds to the code example that we have already developed. Our next step will address the case where ReadPixels fails to read floating point pixels back from the GPU even though we are able to read and write floating point values from the fragment shader. This by the way corresponds to the case with my cell phone, which has an Adreno 220 GPU. This requires us to pack each floating point number from the output texture into an UNSIGNED_BYTE RGBA texture element (texel). Then we will read those texels back onto the CPU, and reconstitute the floating point numbers.

The final path covers situations where we can not use floating point textures at all. Either the OES_texture_float extension is not available and we can not create floating point textures, or CheckFramebufferStatus is incomplete and we can not write (render) the results from our calculations to the floating point texture. In either case case we store all our intermediate result in UNSIGNED_BYTE RGBA textures. At the end of every we convert our floating point results into unsigned bytes, then at the beginning og the next step we convert from unsigned bytes to floating point numbers. These multiple conversions at each step of the computations can have a high cost in terms of performance.

When we use unsigned byte textures, we want to convert data in a floating point format

to and from RGBA unsigned bytes without losing information, that is with little if any loss of precision.

You may notice that the bytes are in ABGR rather than RGBA order. This is because Intel, ARM, and most target platforms are little endian, where the byte order is reversed.

Once we understand what the bits in a floating point number really mean, extracting the bits is not too difficult. An IEEE 754 floating point representation maps to a number by

$( - 1 ) sign 2 ( exponent - 127 ) ( 1 + ∑ i = 1 23 b 23 - i 2 - i )$

If WebGL supported bitwise operations decomposing this might even be easy. We can still decompose this using purely mathematical operations.

We can start with a snippet of code to extract the sign from our floating point value.


float sgn;
sgn = step(0.0, -value);
value = abs(value);


The step function returns 0 if the second argument is less than the first, otherwise it returns 1. But the really interesting question here is, "Why use a function rather then an if or ternary operator or an if-then-else?" Generally, GPUs are structured to run the same code in exactly the same way in parallel on the multiple shader processors. As a result, if's and branches are not handled well, and should be avoided when reasonable. GLSL has a number of functions such as step and clamp that can be used to produce the same effect as conditional execution2.

The next step is to extract the exponent. The number is generally $2 ( exponent - 127 ) × some factor less than 2$ So, if we take $⌊ log 2 ( value ) ⌋$ we are left with $exponent - 127$


float exponent;
exponent =  floor(log2(value));


And we can immediately get the mantissa bits with $\frac{\mathrm{value}}{{2}^{\mathrm{exponent}}}-1$


float mantissa;
mantissa =  value*pow(2.0, -exponent)-1;


Now that we have recovered the mantissa, we can trim off the offset, or bias, from the exponent.


exponent =  exponent+127.0;


We can also set the first (r) byte of our result to the sign bit and the first seven bits of the exponent. We multiply by ${2}^{7}=128$ to shift the sign bit seven bits to the left, and divide by $2$ to shift the exponent one bit to the right.


vec4 result = vec4(0,0,0,0);
result.r = 128.0*sgn + floor(exponent/2.0);


The second (r) byte is the last bit of the exponent followed by the first seven bits of the mantissa.


result.g = (exponent - floor(exponent/2.0) * 2.0) * 128.0 + floor(mantissa * 128.0);


With the blue byte being the next eight bits.


result.b = floor((mantissa - floor(mantissa * 128.0) / 128.0) * 32768.0);


At this point we should realise that we are recomputing multiple expressions, which is something that calls out for refactoring.


vec4 result = vec4(0,0,0,0);

result.a = floor(exponent/2.0);
exponent = exponent - result.a*2.0;
result.a = result.a + 128.0*sgn;

result.b = floor(mantissa * 128.0);
mantissa = mantissa - result.b / 128.0;
result.b = result.b + exponent*128.0;

result.g =  floor(mantissa*32768.0);
mantissa = mantissa - result.g/32768.0;

result.r = floor(mantissa*8388608.0);


Using this, we can build a program that converts from floating point to unsigned byte textures, ToUnsignedBytes. Looking at our process diagram, we see that we invoke this program when the frame buffer status check passes, but we are unable to read the pixels back to the CPU after the computation.


// Tests, terminate on first failure.
success = initializer.test(  0,   0)
&& initializer.test( 10,  12)
&& initializer.test(100, 100);

if (!success)
{
renderToUnsignedBytes(texture);
}


renderToUnsignedBytes uses our ToUnsignedBytes class to convert the format of the results. It closely follows the same general process we outlined for the computations. It's just that the only computation we do is the format conversion.


/**
* Accepts a passed in texture, which is assumed to contain single floating point
* values, and packs each texture element in the corresponding RGBA element of the
* newly created texture.
*
* @param {WebGLTexture} A texture previously populated with floating point values.
*/
function renderToUnsignedBytes(texture)
{
unsignedByteTexture     = gpgpUtility.makeTexture(WebGLRenderingContext.UNSIGNED_BYTE, null);
unsignedByteFramebuffer = gpgpUtility.attachFrameBuffer(unsignedByteTexture);

bufferStatus = gpgpUtility.frameBufferIsComplete();
if (bufferStatus.isComplete)
{
unsignedByteConverter = new ToUnsignedBytes(gpgpUtility);
unsignedByteConverter.convert(matrixColumns, matrixRows, texture);

// Delete resources no longer in use.
unsignedByteConverter.done();

// Tests, terminate on first failure.
success = unsignedByteConverter.test(  0,   0)
&& unsignedByteConverter.test( 10,  12)
&& unsignedByteConverter.test(100, 100);
}
else
{
}
}


The test method reads this data back from the GPU, and much to our relief it is very simple.


// One each for RGBA component of a pixel
buffer = new Uint8Array(4);
// Read a 1x1 block of pixels, a single pixel
gl.readPixels(i,                // x-coord of lower left corner
j,                // y-coord of lower left corner
1,                // width of the block
1,                // height of the block
gl.RGBA,          // Format of pixel data.
gl.UNSIGNED_BYTE, // Data type of the pixel data, must match makeTexture
buffer);          // Load pixel data into buffer

floatingPoint = new Float32Array(buffer.buffer);


The Uint8Array(4) is an array of four unsigned eight bit integers. This matches exactly the four unsigned bytes that we loaded into the texture. To get the floating point number back, we simply reinterpret those bytes as a floating point number with new Float32Array(buffer.buffer). On a big endian architecture, We could also reorder the bytes here.

To use the unsigned byte textures for computations, we will also have to read the data back from within a shader. This means reversing the process we used to pack the number into the RGBA bytes.

This time we unpack individual RGBA bytes from our texel into a floating point number.

Once again we start with the sign, which is the leftmost bit in the alpha byte.


float sgn;
// sgn will be 0 or -1;
sgn = -step(128.0, texel.a);
texel.a += 128.0*sgn;


Next we pull out the exponent. Starting with the last bit to make it easy to trim the bit from the blue byte.


float exponent;
exponent = step(128.0, texel.b);
texel.b -= exponent*128.0;
// Multiple by 2 => left shift by one bit.
exponent += 2.0*texel.a -127.0


The remaining unprocessed bits are the mantissa.


float mantissa;
mantissa = texel.b*65536.0 + texel.g*256.0 + texel.r;


Finally we assemble all the components into the whole result, referencing the same equation we drew upon earlier.

$( - 1 ) sign 2 ( exponent - 127 ) ( 1 + ∑ i = 1 23 b 23 - i 2 - i )$

float value;
value = sgn * exp2(exponent)*(1.0 + mantissa * exp2(-23.0));


The general usage pattern will be to read the texture data, unpack it, perform your calculations, then pack the results into a form that can be stored in the texture.

## Precision

Almost as common an issue is the precision supported by the GPU. That is how many bits that it uses to store floating point numbers. We have implicitly acknowledged that with a block at the beginning of our fragment shaders.


#ifdef GL_FRAGMENT_PRECISION_HIGH
precision highp float;
#else
precision mediump float;
#endif


Lower precision representations simply use fewer bits to represent a number. This example shows a 16 bit IEEE float. Compare it with the 32 bit float above.

One of the most obvious differences is that the offset for the exponent is much smaller at 15.

$( - 1 ) sign 2 ( exponent - 15 ) ( 1 + ∑ i = 1 10 b 10 - i 2 - i )$

It looks like we can work out these differences, in most cases, through the same ifdef's we use to define the floating precision. However it is important to consider whether the lowered precision provides the needed accuracy for your simulations.