First OpenCL program using C++ and CMake

OpenCL logo

After doing the installation process we have all the libraries and the SDKs for the different vendors. To test that everything is fine we will write a simple script that sums a bidimensional array of floats and stores it into another array called result The first GPU device will be used to perform the computation.

The code is shown in fully and contains code to help understand each section. The guide takes for granted that the reader has understanding of OpenCL core concepts.

Note:

This page does not get into detail about how OpenCL works. If you really want to understand in detail what the functions and structures do, I recommend you to read OpenCL in Action: How to Accelerate Graphics and Computation from Matthew Scarpino; ISBN 9781617290176 The author does a good job by explaining the core concepts of OpenCL.

Kernel addition function

The kernel source is the function that gets executed on the target device, for this example we are specifically targeting the GPU, as instructed by the constant CL_DEVICE_TYPE_GPU while we initialize the cl::Program object.

This kernel reads from the global memory the bidimensional array data in batches of two floats and perform an addition between them. Then it stores the result variable in the global memory that later will be read by the host.

This file is called simple_add.cl

/**
 * Kernel function to sum two arrays
 **/
__kernel void simple_add(__global float2 *data, __global float *result){
    int index = get_global_id(0);
    result[index] = data[index].s0 + data[index].s1;
}

Host program

The host imports opencl.hpp library to include OpenCL structures, definitions and methods.
First the Platforms are obtained and stored in a vector using the method cl::Platform::get. Only the first platform will be selected.
From this platform we will collect only the GPUs and store them in a vector by using platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
We will only use the first device to perform the addition, and after collecting the CL_DEVICE_NAME we will create a context containing only this device.
After that we will use a file stream to read and then cast to std::string the kernel. This kernel will be sent as argument to the cl::Program program{context, src, false, &err}; constructor, and once constructed it will be built.
If there are errors they would be displayed, but if it succeeds a simple bidimensional array will be created to store the values to add, and a result array to store the results.
Two data buffers will be created, one to write the data to the device and another to read the result from the device.
The kernel will be created and the two buffers will be sent as arguments. This will allocate global memory on the device where data has been flagged previously as CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR and result as CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR
A task queue for the device will be created and the kernel will be enqueued with enqueueNDRangeKernel, after this operation enqueueReadBuffer will be performed to read the result array from the device.

This file is called main.cpp

#include <iostream>
#include <vector>
#include <fstream>
#include <CL/opencl.hpp>


/**
 * This program add two vectors together using OpenCL
 *
 * @author Antoni Mercer<antonimercer@lthjournal.com>
 * @return
 */
int main() {
    // define vector where to store the platforms
    std::vector<cl::Platform> platforms;
    // collect the platforms available
    cl::Platform::get(&platforms);

    // make sure at least we have a platform
    if (platforms.empty()){
        std::cerr << "No platforms found!" << std::endl;
        exit(1);
    }

    // collect the first platform available
    cl::Platform platform{platforms.front()};
    // initialize vector to store the devices from the platform
    std::vector<cl::Device> devices;
    // collect the GPUs devices from the platform
    platform.getDevices(CL_DEVICE_TYPE_GPU, &devices);

    // make sure at least a GPU is available
    if (devices.empty()){
        std::cerr << "No GPU devices found!" << std::endl;
        exit(1);
    }

    // Get the fist device
    cl::Device device {devices.front()};
    std::cerr << "Device name: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;

    // create a context with only the fist device
    cl_int err;
    cl::Context context{device, nullptr, nullptr, nullptr, &err};
    if (err != 0) {
        std::cerr << "Failed to create context: " << err << std::endl;
        exit(1);
    }

    // load the kernel program into an input file stream
    // TODO replace this line if the binary and the file are in the same folder
    std::ifstream sourceFile{"../simple_add.cl"};
    // dump the kernel program file stream into a string var
    std::string src{std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>())};

    // make sure the source read is the kernel
    if (src.empty()) {
        std::cerr << "Source file is empty." << std::endl;
        exit(1);
    }
    std::cout << "source read: " << std::endl << src << std::endl;

    // create a program with the context and the kernels source, pass false to do not build it during program construction
    cl::Program program{context, src, false, &err};
    if (err != 0) {
        std::cerr << "Program creation failed: " << err << std::endl;
        exit(1);
    }

    // build the program and make sure it succeeds
    err = program.build();
    if(err != CL_BUILD_SUCCESS){
        std::cerr << "Build Status: " << program.getBuildInfo<CL_PROGRAM_BUILD_STATUS>(device) << std::endl
                  << "Build Log:\t " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device) << std::endl;
        exit(1);
    }

    // create the array of numbers to add
    float data[4][2] = {
            0.1f, 0.1f,
            0.2f, 0.2f,
            0.3f, 0.3f,
            0.4f, 0.4f,
    };
    // create an empty array
    float result[4] = {};

    // create the two buffers to send the arrays to the device
    // INPUT BUFFER
    cl::Buffer dataBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(data), &data[0], &err);
    if (err != 0) {
        std::cerr << "DATA Buffer creation failed: " << err << std::endl;
        exit(1);
    }
    // create the buffer where the result will be stored.
    // OUTPUT BUFFER
    cl::Buffer resultBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(result), &result[0], &err);
    if (err != 0) {
        std::cerr << "RESULT Buffer creation failed: " << err << std::endl;
        exit(1);
    }

    // create the kernel and pass the name of the function
    cl::Kernel kernel{program, "simple_add", &err};
    if (err != 0) {
        std::cerr << "Kernel creation failed: " << err << std::endl;
        exit(1);
    }

    // add the kernel arguments
    kernel.setArg(0, dataBuffer);
    kernel.setArg(1, resultBuffer);

    // create a queue with the context and the device
    cl::CommandQueue queue(context, device);
    // enqueue que kernel without offset and the range
    queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(sizeof(data)));
    // enqueue a buffer read to request the device to write the result to the host object
    queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(result), &result[0]);

    // Print out the result
    std::cout << "Vector addition result: " << std::endl;
    for (float n:result)
        std::cout << n << std::endl;

    // the program is done running
    std::cout << "Done." << std::endl;
    return 0;
}

CMakeLists.txt code

This file is called CMakeLists.txt

cmake_minimum_required(VERSION 3.28)
project(OpenCL_FirstProgram LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)

add_executable(${PROJECT_NAME} main.cpp)

find_package(OpenCL REQUIRED)
target_link_libraries(${PROJECT_NAME} OpenCL::OpenCL)

target_compile_definitions(${PROJECT_NAME} PRIVATE CL_HPP_TARGET_OPENCL_VERSION=300)

Program output

Device name: AMD Radeon RX 580 Series (polaris10, LLVM 15.0.6, DRM 3.49, 6.1.0-26-amd64)
source read:
/**
 * Kernel function to sum two arrays
 **/
__kernel void simple_add(__global float2 *data, __global float *result){
    int index = get_global_id(0);
    result[index] = data[index].s0 + data[index].s1;
}

Vector addition result:
0.2
0.4
0.6
0.8
Done.

Process finished with exit code 0

If you spot any typos, have questions, or need assistance with the build, feel free to contact me at: antonimercer@lthjournal.com

This guide contains no affiliate links or ads. If you'd like to support this or future projects, you can do so here:

By supporting monthly you will help me create awesome guides and improve current ones.


Technologies used

Debian 12, Linux, OpenCL, C++, AMD, Nvidia, Intel, GPU

Books are knowledge and knowledge is power.

After ending my studies, I always tried to dedicate some time to books. They helped me a lot and I want to dedicate a little space here as a greeting. From basics on different languages to more advanced level.