Getting started with JCuda


Overview



Futher information:


A detailed article about GPU Computing Using CUDA, Eclipse, and Java with JCuda has been published by Mark Bishop. It is an excellent resource for further information about the setup of CUDA and JCuda on Linux platforms, and the setup of a JCuda project in Eclipse.



Introduction



CUDA provides two different APIs: The Runtime API and the Driver API. Both APIs are very similar concerning basic tasks like memory handling. In fact, starting with CUDA 3.0, both APIs are interoperable and can be mixed to some extent. However, there are some important differences. The most important difference between the Runtime API and the Driver API for JCuda is the way how kernels are managed and executed:

In the original CUDA Runtime API, kernels are defined and compiled together with C files. The source code is compiled by the NVCC, the NVIDIA CUDA Compiler. This compiler uses another C compiler (for example, the GCC or Visual Studio Compiler) to compile the plain C parts of the source code, and takes care of the compilation of the CUDA specific parts, like the CUDA kernels and the kernel<<<...>>> calls. The result of this compilation is usually an executable file comprising the whole program.

Of course, the NVCC can not be used to compile a Java program. The kernel<<<...>>> call syntax can not be used in Java, and there is not a single, executable file after the compilation. Thus, it is not possible to call own CUDA kernels with the JCuda Runtime API. Instead, the JCuda Driver API has to be used, as explained in the section about Creating kernels.

The JCuda Runtime API is mainly intended for the interaction with the Java bindings of the the CUDA Runtime libraries, like JCublas and JCufft. A Java programmer who only wants to use these libraries and does not want to create own CUDA kernels can use the JCuda Runtime API for the interaction with these libraries. The Samples section contains basic example programs for each of the available runtime libraries, which may serve as starting points for own JCuda Runtime programs.



General setup


In order to use JCuda, you need an installation of the CUDA driver and toolkit, which may be obtained from the NVIDIA CUDA download site. (Note that there may be some delay between the release of a new CUDA version and the release of the matching JCuda version). You should first install the Developer Drivers for your operating system, and then the matching CUDA Toolkit. Plaese consult also the documentation from the NVIDIA site for the proper setup and installation procedure.

The SDK and code samples are not required to use JCuda, but the code examples may be helpful to get started and to see whether CUDA is working in general.

After CUDA has been properly installed, you may download the JCuda archive for your operating system from the downloads section.

For JCuda 0.8.0RC and newer versions:


The archives contain the main JAR files, as well as JAR files that contain the native libraries (which are .DLL files for Windows, .SO files for Linux and .DYLIB files for MacOS). All required JAR files have to be present in the CLASSPATH.

For earlier JCuda versions (below 0.8.0RC)


The archives contain the JAR files, and the matching native libraries (which are .DLL files for Windows, .SO files for Linux and .DYLIB files for MacOS). The JAR files have to be present in the CLASSPATH, and the native library files must be located in a path that is visible for Java. In most cases, this should either be a path that is given as a java.library.path for the JVM, or the root directory of the project. (Alternatively, they can also be in a path that is contained in an environment variable like the PATH environment variable on Windows or the LD_LIBRARY_PATH environment variable on Linux).



Basic test


This section describes how to manually set up a minimum JCuda project from the command line as a basic test. If you are already familiar with using JARs and native libraries from Java, you may probably skip this section and create a new JCuda project directly in your favorite IDE. Otherwise, you may create a first JCuda project by following these steps:
If you encounter any problem during this test, it's most likely an UnsatisfiedLinkError. You may consider opening a thread in the Forum. including information about the operating system, CUDA version and JCuda version that you are using. (There is a Forum FAQ Entry about the UnsatisfiedLinkError which may help to solve this problem for the case that you are using an older JCuda version).


Creating kernels


As described in the Introduction, own CUDA kernels can be launched in JCuda using the Driver API. This section will describe the basic workflow for creating and compiling a simple kernel, and for loading and executing the kernel with JCuda. Most of the information presented here applies equally to CUDA and JCuda, and more detailed information is available, for example, in the CUDA Programming Guide. This section is mainly intended as a quick start, and to point out potential differences between CUDA and JCuda.


The source code for the example described here is available as the JCudaVectorAdd example from the samples section. The sample tries to compile the kernel at runtime, but the general process of manually compiling a kernel is described here.



Writing the kernel


This kernel code is written exactly in the same way as it is done for CUDA. Usually, the kernel code will be located in an individual file. (In the CUDA Runtime API, the kernel function often is part of a larger C file. While it is still possible to have additional C code in the same file as the kernel, this C code will be ignored and not relevant for JCuda).

There is only one important aspect to consider: When the kernel should be executed with the Driver API (regardless of whether it is used in CUDA or JCuda), the kernel function has to be identified and accessed by specifying its name. But when the code is compiled with a C/C++ compiler, the name of the function will be mangled - that means that the function name will internally be modified depending on its signature, and a simple kernel function name, like "add", may be converted to a name like "_Z3addiPfS_S_". While it is still possible to access the function using this name, it is in general much easier more intuitive to declare the kernel function as an
extern "C"
function. This way, the original name will be preserved. As a example, here is a kernel which performs a simple vector addition:

JCudaVectorAddKernel.cu

extern "C"
__global__ void add(int n, float *a, float *b, float *sum)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    {
        sum[i= a[i+ b[i];
    }
}



Compiling the kernel


The kernel source code will have to be compiled by the NVCC compiler. This will create a file that can be loaded and executed using the Driver API. There are basically two options how the kernel can be compiled: While earlier examples from the Samples section generally used CUBIN files, they have an important drawback: They are specific for the Compute Capability of the GPU. The Compute Capability is a sort of a version number for the hardware, and CUBIN files that have been created for one Compute Capability can not be loaded on a GPU with a different Compute Capability. Thus, upcoming samples will in general prefer the usage of PTX files, since they are compiled at runtime for the GPU of the target machine.

A PTX file can be created from a simple, single CUDA source code file with the following command:
    nvcc -ptx JCudaVectorAddKernel.cu -o JCudaVectorAddKernel.ptx

In order to create a valid CUBIN file, it may be necessary to specify the architecture and Compute Capability of the target machine. The full command line for creating a CUBIN file for a GPU with Compute Capability 2.1 on a 64 bit machine would be
    nvcc -cubin -m64 -arch sm_21 JCudaVectorAddKernel.cu -o JCudaVectorAddKernel.cubin

For more information about the NVCC and its command line parameters, see the documentation of the NVCC in the /doc/ directory of your CUDA Toolkit installation.



Loading and executing the kernel in JCuda


The process of loading and executing a kernel from a PTX- or CUBIN file in the JCuda Driver API is the same as in the CUDA Driver API. The most simple example of a single kernel will be summarized here.

First of all, the PTX- or CUBIN file has to be loaded, and a pointer to the kernel function has to be obtained:

// Load the ptx file.
CUmodule module = new CUmodule();
cuModuleLoad(module, "JCudaVectorAddKernel.ptx");

// Obtain a function pointer to the kernel function.
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "add");

Note that the cuModuleLoad function will automatically detect the type of the specified file. So to load a CUBIN file, the same function can be used.

For calling the kernel, some of the language-specific limitations of Java may become more obvious. The functions for setting up the kernel parameters had been rather difficult to use up to CUDA 3.2, and in CUDA 4.0, these functions have been replaced by a single function. This function receives all parameters that describe the kernel execution. Additionally, it receives all kernel parameters in a single void** pointer. A void** pointer is is emulated using the Pointer class in JCuda. With this class, the setup of the kernel parameters may even be simpler in JCuda than in CUDA:

// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
    Pointer.to(new int[]{numElements})
    Pointer.to(deviceInputA)
    Pointer.to(deviceInputB)
    Pointer.to(deviceOutput)
);

// Call the kernel function.
cuLaunchKernel(function, 
    gridSizeX,  11,      // Grid dimension 
    blockSizeX, 11,      // Block dimension
    0, null,               // Shared memory size and stream 
    kernelParameters, null // Kernel- and extra parameters
)

However, one has to take the same care here as in C: The number of pointer indirections has to be verified carefully. Having to create a pointer to a pointer to a pointer in order to pass a pointer as one parameter to a kernel may look confusing at the first glance, but the existing sample programs should help to get this right, and afterwards, the same pattern can be applied to nearly all kernel launches.