Hey guys, I want to share how you can write your first GPU Kernel using Metal. So, without further ado, let's do this.
Our example will be straightforward but will lay the foundation needed. All our code will do is write data from one buffer onto another.
These are the steps we will take:
- Setup Xcode project for Compute kernels
- Initialize our Device, Library and Command Queue
- Initialize and load buffers with data
- Launch the Compute Kernel
- Write the compute Kernel
Setup Xcode for Compute Kernels
Open up Xcode. Select File and click on New Project. Next, select "Command Line Tool." Give a name to your project and choose "Objective-C" as the language. (If you prefer to use Swift, do so. The sample project is simple enough that you should be able to follow along).
In the main.m file, make sure to import these two libraries:
#import <Metal/Metal.h>
#import <simd/simd.h>
Initialize device, library, and Command Queue
The first thing we need to do is to get a handle on our GPU device. In Metal, this is done as follows:
id<MTLDevice> device=MTLCreateSystemDefaultDevice();
Once we have a device, we can create an MTLLibrary object. This object will be responsible for compiling our Kernel. A library is created as shown below:
id<MTLLibrary> defaultLibrary=[device newDefaultLibrary];
Now that we have a library object, we can create a reference to our Kernel. This is done using the MTLFunction. So, let's get a reference to our Kernel:
id<MTLFunction> kernel=[defaultLibrary newFunctionWithName:@"inOutExample"];
The function declaration name of our Kernel is "inOutExample"
We can now create a Compute Pipeline State with our kernel reference. This Compute Pipeline will be attached to our encoder.
id<MTLComputePipelineState> kernelPSO=[device newComputePipelineStateWithFunction:kernel error:nil];
Ok, so we have a device, a library, and a kernel reference. Next, we need to get a reference to a Command Queue. The Command Queue will be responsible for sending our work to the GPU. A command queue is created using the following line of code:
id<MTLCommandQueue> commandQueue=[device newCommandQueue];
Load data into buffers
We are almost done with our initialization routine. We need to create two buffers and load them with data. Creating and loading buffers is shown below:
id<MTLBuffer> inBuf=[device newBufferWithLength:bufferSize*sizeof(uint) options:MTLResourceStorageModeShared];
id<MTLBuffer> outBuf=[device newBufferWithLength:bufferSize*sizeof(uint) options:MTLResourceStorageModeShared];
//load buf with simple data
uint *dataIn=(uint*)inBuf.contents;
for (int i=0; i<bufferSize; i++) {
dataIn[i]=i;
}
Ok, so our initialization is all done.
Launch the Compute Kernels
Next, we are going to launch our compute Kernel. To do that, we need to create a Command Buffer as shown below:
//create a command buffer
id<MTLCommandBuffer> commandBuffer=[commandQueue commandBuffer];
We need to link our Compute Pipeline previously created to an encoder. So, let's create an encoder and attach the pipeline. Note that the compute encoder is created using our newly created Command Buffer object.
id<MTLComputeCommandEncoder> computeEncoder=[commandBuffer computeCommandEncoder];
//encode the pipeline state object
[computeEncoder setComputePipelineState:kernelPSO];
We must inform our Kernel how to reference the two buffers you created earlier. You do so by specifying indices using the command encoder:
[computeEncoder setBuffer:inBuf offset:0 atIndex:0];
[computeEncoder setBuffer:outBuf offset:0 atIndex:1];
So, the inBuf will be referenced by using index 0. And outBuf will be referenced using index 1. Once we start writing the Kernel, you will see what I mean.
Ok, finally, we need to dispatch our threads. We are going to dispatch 32 threads. I'm not going to go into how to calculate the thread groups in this post since it can get confusing fast, so copy the line of code shown below:
NSUInteger width=kernelPSO.threadExecutionWidth;
MTLSize threadsPerThreadgroup=MTLSizeMake(width,1,1); //(32,1)
MTLSize threadsPerGrid=MTLSizeMake(bufferSize,1,1); //(1024,1)
[computeEncoder dispatchThreads:threadsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
Finally, we have to stop encoding and commit our command buffer. Since we want to print results, we need to wait until the GPU has been completed to get our data back. All of this is implemented as shown below:
//end compute pass
[computeEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
//print results
uint *outBufData=(uint*)outBuf.contents;
for (int i=0; i<bufferSize; i++) {
NSLog(@"%i",outBufData[i]);
}
Ok, so we are done with the initialization and dispatching of our Kernel. Now, we need to write our Kernel.
Write your first Compute Function
First, create a new Metal File by clicking on File->new.
All compute functions declaration must start with the keyword kernel. In our example, the parameters point to the buffer data we previously declared. Moreover, since we access the buffer data from Device memory, we must declare the buffers with the device keyword.
kernel void inOutExample(device uint *inBuf[[buffer(0)]],device uint *outBuf[[buffer(1)]], uint gid [[thread_position_in_grid]], uint threadid [[thread_position_in_threadgroup]], uint blockDim [[threads_per_threadgroup]], uint blockid[[threadgroup_position_in_grid]]){
int i=blockid*blockDim+threadid;
outBuf[i]=inBuf[i];
}
The inBuf is referenced using index 0. Whereas outBuf is referenced using index 1.
Our Kernel is simple. It simply copies data from one buffer to another using an index. The index refers to the current thread dispatched. I will talk more about it later. For now, keep in mind that our example launches 32 threads. Each thread is used as an identifier in our buffer array.
So we are all done. Hit Run and see the output window in Xcode. You should see numbers 0-31 printed on the output log. We show that the input buffer's data was copied into the output buffer.
Congrats, you just wrote and set up your first compute kernel function.
You can find the complete code in my Patreon Page.