TLDR: How can I run many small kernels, one at a time, without significant overhead?
I'm working on a project that acts as a virtual green screen. It takes in an image feed, looks for pixels similar to a color key, and replaces those pixels with a replacement color. I plan to output the resulting image feed as a virtual webcam in Windows. The full source code is on Github. Currently, I'm using OpenCL bindings in Java (JOCL) to accelerate the process. The main application is written in JavaFX with Kotlin, which I'm comfortable with, but the OpenCL kernels are written in C, which I'm new to.
This is the main "API" I created for the program. I tried making the API interface relatively open so I can add direct Cuda support in the future.
class OpenClApi constructor(
platformIndex: Int = 0,
deviceIndex: Int = 0,
val localWorkSize: Long? = null
) : AbstractApi {
companion object : AbstractApi.AbstractApiConsts {
override val listName = "OpenCl"
enum class ClMemOperation(val flags: Long) {
// CL_MEM_USE_HOST_PTR instead of CL_MEM_COPY_HOST_PTR speeds up most operations for realtime video
READ(CL_MEM_READ_ONLY or CL_MEM_USE_HOST_PTR),
WRITE(CL_MEM_WRITE_ONLY)
}
private fun getPlatforms(): Array<cl_platform_id?> {
val numPlatformsArray = IntArray(1)
clGetPlatformIDs(0, null, numPlatformsArray)
val numPlatforms = numPlatformsArray[0]
val platforms = arrayOfNulls<cl_platform_id>(numPlatforms)
clGetPlatformIDs(platforms.size, platforms, null)
return platforms
}
private fun getPlatform(platformId: Int) = getPlatforms()[platformId]
?: throw ArrayIndexOutOfBoundsException("Couldn't find the specified platform")
fun getPlatformsMap(): Map<Int, String> {
val platforms = getPlatforms()
val result = mutableMapOf<Int, String>()
for (platformId in platforms.indices) {
val platformFromList = platforms[platformId]
val size = LongArray(1)
clGetPlatformInfo(platformFromList, CL_PLATFORM_NAME, 0, null, size)
val buffer = ByteArray(size[0].toInt())
clGetPlatformInfo(platformFromList, CL_PLATFORM_NAME, buffer.size.toLong(), Pointer.to(buffer), null)
result[platformId] = String(buffer, 0, buffer.size - 1)
}
return result
}
private fun getDevices(platformId: Int): Array<cl_device_id?> {
val platform = getPlatform(platformId)
val numDevicesArray = IntArray(1)
clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, null, numDevicesArray)
val numDevices = numDevicesArray[0]
val devices = arrayOfNulls<cl_device_id>(numDevices)
clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, numDevices, devices, null)
return devices
}
private fun getDevice(platformId: Int, deviceId: Int) = getDevices(platformId)[deviceId]
?: throw ArrayIndexOutOfBoundsException("Couldn't find the specified platform or device")
fun getDevicesMap(platformId: Int): Map<Int, String> {
val devices = getDevices(platformId)
val result = mutableMapOf<Int, String>()
for (deviceId in devices.indices) {
val deviceFromList = devices[deviceId]
val size = LongArray(1)
clGetDeviceInfo(deviceFromList, CL_DEVICE_NAME, 0, null, size)
val buffer = ByteArray(size[0].toInt())
clGetDeviceInfo(deviceFromList, CL_DEVICE_NAME, buffer.size.toLong(), Pointer.to(buffer), null)
result[deviceId] = String(buffer, 0, buffer.size - 1)
}
return result
}
}
private val platform: cl_platform_id = getPlatform(platformIndex)
private val contextProperties: cl_context_properties = cl_context_properties()
private val device: cl_device_id = getDevice(platformIndex, deviceIndex)
private val context: cl_context = clCreateContext(contextProperties, 1, arrayOf(device), null, null, null)
val commandQueue: cl_command_queue
val program: cl_program
init {
setExceptionsEnabled(true)
contextProperties.addProperty(CL_CONTEXT_PLATFORM.toLong(), platform)
val properties = cl_queue_properties()
commandQueue = clCreateCommandQueueWithProperties(context, device, properties, null)
val sources = arrayOf(
"Util",
"InitialComparison",
"NoiseReduction",
"FlowKey",
"Splash",
"SplashPrep"
).map {
this::class.java.getResource("$it.cl")!!.readText()
}.toTypedArray()
program = clCreateProgramWithSource(context, sources.size, sources, null, null)
clBuildProgram(program, 0, null, null, null, null)
}
override fun getFilters(): Map<String, AbstractFilter> = mapOf(
OpenClInitialComparisonFilter.listName to OpenClInitialComparisonFilter(api = this),
OpenClNoiseReductionFilter.listName to OpenClNoiseReductionFilter(api = this),
OpenClFlowKeyFilter.listName to OpenClFlowKeyFilter(api = this),
OpenClSplashFilter.listName to OpenClSplashFilter(api = this),
)
override fun close() {
clReleaseProgram(program)
clReleaseCommandQueue(commandQueue)
clReleaseContext(context)
}
fun allocMem(ptr: Pointer?, op: ClMemOperation, size: Int): cl_mem = clCreateBuffer(
context,
op.flags,
size.toLong(),
ptr,
null
)
}
This is an example "Filter" that processes a frame using an API instance.
class OpenClInitialComparisonFilter @Suppress("LongParameterList") constructor(
private val api: OpenClApi,
var colorKey: ByteArray = byteArrayOf(0, 255.toByte(), 0),
var replacementKey: ByteArray = byteArrayOf(0, 255.toByte(), 0),
var percentTolerance: Float = 0.025f,
var colorSpace: ColorSpace = ColorSpace.ALL,
var width: Int = DEFAULT_WIDTH_PIXELS,
var height: Int = DEFAULT_HEIGHT_PIXELS
) : AbstractFilter{
companion object : AbstractFilterConsts {
override val listName = "Initial Comparison"
private const val KERNEL_NAME = "initialComparisonKernel"
}
override fun getProperties(): Map<AbstractFilterProperty, Any> = mapOf(
AbstractFilterProperty.TOLERANCE to percentTolerance,
AbstractFilterProperty.COLOR_KEY to colorKey,
AbstractFilterProperty.REPLACEMENT_KEY to replacementKey,
AbstractFilterProperty.COLOR_SPACE to colorSpace
)
override fun setProperty(listName: String, newValue: Any) = when (listName) {
AbstractFilterProperty.TOLERANCE.listName -> percentTolerance = newValue as Float
AbstractFilterProperty.COLOR_KEY.listName -> colorKey = newValue as ByteArray
AbstractFilterProperty.REPLACEMENT_KEY.listName -> replacementKey = newValue as ByteArray
AbstractFilterProperty.COLOR_SPACE.listName -> colorSpace = newValue as ColorSpace
else -> throw ArrayIndexOutOfBoundsException("Couldn't find property $listName")
}
@Suppress("LongMethod")
override fun apply(inputBuffer: ByteArray): ByteArray {
val outputBuffer = ByteArray(size = inputBuffer.size)
val floatOptionsBuffer = floatArrayOf(percentTolerance)
val intOptionsBuffer = intArrayOf(colorSpace.i, width, height)
val inputPtr = Pointer.to(inputBuffer)
val outputPtr = Pointer.to(outputBuffer)
val colorKeyPtr = Pointer.to(colorKey)
val replacementKeyPtr = Pointer.to(replacementKey)
val floatOptionsPtr = Pointer.to(floatOptionsBuffer)
val intOptionsPtr = Pointer.to(intOptionsBuffer)
val inputMem = api.allocMem(inputPtr, ClMemOperation.READ, Sizeof.cl_char * inputBuffer.size)
val outputMem = api.allocMem(null, ClMemOperation.WRITE, Sizeof.cl_char * outputBuffer.size)
val colorKeyMem = api.allocMem(colorKeyPtr, ClMemOperation.READ, Sizeof.cl_char * colorKey.size)
val replacementKeyMem = api.allocMem(
replacementKeyPtr,
ClMemOperation.READ,
Sizeof.cl_char * replacementKey.size
)
val floatOptionsMem = api.allocMem(
floatOptionsPtr,
ClMemOperation.READ,
Sizeof.cl_float * floatOptionsBuffer.size
)
val intOptionsMem = api.allocMem(intOptionsPtr, ClMemOperation.READ, Sizeof.cl_int * intOptionsBuffer.size)
val kernel = clCreateKernel(api.program, KERNEL_NAME, null)
var a = 0
clSetKernelArg(kernel, a++, Sizeof.cl_mem.toLong(), Pointer.to(inputMem))
clSetKernelArg(kernel, a++, Sizeof.cl_mem.toLong(), Pointer.to(outputMem))
clSetKernelArg(kernel, a++, Sizeof.cl_mem.toLong(), Pointer.to(colorKeyMem))
clSetKernelArg(kernel, a++, Sizeof.cl_mem.toLong(), Pointer.to(replacementKeyMem))
clSetKernelArg(kernel, a++, Sizeof.cl_mem.toLong(), Pointer.to(floatOptionsMem))
clSetKernelArg(kernel, a, Sizeof.cl_mem.toLong(), Pointer.to(intOptionsMem))
val globalWorkSizeBuffer = api.localWorkSize?.let {
longArrayOf(ceil(inputBuffer.size / it.toFloat()).toLong() * it)
} ?: longArrayOf(inputBuffer.size.toLong())
val localWorkSizeBuffer = api.localWorkSize?.let { longArrayOf(api.localWorkSize) }
clEnqueueNDRangeKernel(
api.commandQueue,
kernel,
1,
null,
globalWorkSizeBuffer,
localWorkSizeBuffer,
0,
null,
null
)
clEnqueueReadBuffer(
api.commandQueue,
outputMem,
CL_TRUE,
0,
(inputBuffer.size * Sizeof.cl_char).toLong(),
outputPtr,
0,
null,
null
)
clReleaseMemObject(inputMem)
clReleaseMemObject(outputMem)
clReleaseMemObject(colorKeyMem)
clReleaseMemObject(replacementKeyMem)
clReleaseMemObject(floatOptionsMem)
clReleaseMemObject(intOptionsMem)
clReleaseKernel(kernel)
return outputBuffer
}
}
Here is an example of the InitialComparison
kernel, which looks for and replaces similar pixels.
enum ColorSpace {
BLUE = 0,
GREEN = 1,
RED = 2,
ALL = 3
};
enum FloatOptions {
PERCENT_TOLERANCE = 0,
GRADIENT_TOLERANCE = 1
};
enum IntOptions {
COLOR_SPACE = 0,
WIDTH = 1,
HEIGHT = 2,
BLOCK_SIZE = 3
};
float calcColorDiff(
const char *a,
const int i,
const char *b,
const int j,
const int colorSpace
) {
float colorDiff[3];
for (int k = 0; k < 3; k++) {
colorDiff[k] = abs(a[i + k] - b[j + k]);
}
if (colorSpace < 3) {
return colorDiff[colorSpace] / 255.0;
} else {
float percentDiff = 0.0;
for (int i = 0; i < 3; i++) {
percentDiff += colorDiff[i] / 765.0;
}
return percentDiff;
}
}
void writePixel(
char *canvas,
const int i,
const char *ink,
const int j
) {
for (int k = 0; k < 3; k++) {
canvas[i + k] = ink[j + k];
}
}
__kernel void initialComparisonKernel(
__global const char *input,
__global char *output,
__global const char *colorKey,
__global const char *replacementKey,
__global const float *floatOptions,
__global const int *intOptions
) {
float percentTolerance = floatOptions[PERCENT_TOLERANCE];
int colorSpace = intOptions[COLOR_SPACE];
int gid = get_global_id(0);
if (gid % 3 == 0) {
float percentDiff = calcColorDiff(input, gid, colorKey, 0, colorSpace);
if (percentDiff < percentTolerance) {
writePixel(output, gid, replacementKey, 0);
} else {
writePixel(output, gid, input, gid);
}
}
}
It works pretty well! Much faster than running it on a CPU, even using Java's multithreading ExecutorService
. On top of the comparison, I also run two additional filters: NoiseReduction
, which removes green screen pixels that aren't mostly surrounded by other green screen pixels, and FlowKey
, which fills in the gaps between green screen pixels.
int checkPixelEquality(
const char *input,
const int i,
const char *colorKey
) {
int diffSum = 0;
for (int j = 0; j < 3; j++) {
diffSum += abs(input[i + j] - colorKey[j]);
}
if (diffSum == 0) {
return 1;
} else {
return 0;
}
}
__kernel void noiseReductionKernel(
__global const char *input,
__global char *output,
__global const char *template,
__global const char *colorKey,
__global const int *intOptions
) {
int width = intOptions[WIDTH];
int height = intOptions[HEIGHT];
int gid = get_global_id(0);
if (gid % 3 == 0) {
int anchorEquality = checkPixelEquality(input, gid, colorKey);
if (anchorEquality == 1) {
int surroundingPixels = 0;
if ((gid / 3) % width == 0) {
surroundingPixels += 1;
} else {
surroundingPixels += checkPixelEquality(input, gid - 3, colorKey);
}
if ((gid / 3) % width == width - 1) {
surroundingPixels += 1;
} else {
surroundingPixels += checkPixelEquality(input, gid + 3, colorKey);
}
if ((gid / 3) / width == 0) {
surroundingPixels += 1;
} else {
surroundingPixels += checkPixelEquality(input, gid - (width * 3), colorKey);
}
if ((gid / 3) / width == height - 1) {
surroundingPixels += 1;
} else {
surroundingPixels += checkPixelEquality(input, gid + (width * 3), colorKey);
}
if (surroundingPixels < 3) {
writePixel(output, gid, template, gid);
} else {
writePixel(output, gid, colorKey, 0);
}
} else {
writePixel(output, gid, template, gid);
}
}
}
__kernel void flowKeyKernel(
__global const char *input,
__global char *output,
__global const char *template,
__global const char *colorKey,
__global const float *floatOptions,
__global const int *intOptions
) {
float gradientTolerance = floatOptions[GRADIENT_TOLERANCE];
int colorSpace = intOptions[COLOR_SPACE];
int width = intOptions[WIDTH];
int height = intOptions[HEIGHT];
int gid = get_global_id(0);
if (gid % 3 == 0) {
if (checkPixelEquality(input, gid, colorKey) == 0) {
if (
(gid / 3) % width != 0 &&
checkPixelEquality(input, gid - 3, colorKey) == 1 &&
calcColorDiff(input, gid, template, gid - 3, colorSpace) > gradientTolerance
) {
writePixel(output, gid, colorKey, 0);
return;
}
if (
(gid / 3) % width != width - 1 &&
checkPixelEquality(input, gid + 3, colorKey) == 1 &&
calcColorDiff(input, gid, template, gid + 3, colorSpace) > gradientTolerance
) {
writePixel(output, gid, colorKey, 0);
return;
}
if (
(gid / 3) / width != 0 &&
checkPixelEquality(input, gid - (width * 3), colorKey) == 1 &&
calcColorDiff(input, gid, template, gid - (width * 3), colorSpace) > gradientTolerance
) {
writePixel(output, gid, colorKey, 0);
return;
}
if (
(gid / 3) / width != height - 1 &&
checkPixelEquality(input, gid + (width * 3), colorKey) == 1 &&
calcColorDiff(input, gid, template, gid + (width * 3), colorSpace) > gradientTolerance
) {
writePixel(output, gid, colorKey, 0);
return;
}
writePixel(output, gid, template, gid);
} else {
writePixel(output, gid, colorKey, 0);
}
} else {
writePixel(output, gid, colorKey, 0);
}
}
The issue is that these kernels have insignificant runtime as compared to queueing the kernels through clEnqueueNDRangeKernel
. This means that the overhead for running all of the kernels is too large, resulting in frame latency. Each of the filters must be run one at a time until the image has been fully processed.
My current understanding of OpenCL is that within a queued kernel, each workgroup will get queued without any specific order and without any guarantee of total concurrency. Because these filters must be applied one at a time across the entire image, the only option I can think of is to queue many small kernels.
I've tried aggregating all the kernels into one large kernel (code below). There are two issues:
__kernel void openClKernel(
__global const char *input,
__global char *output,
__global const char *colorKey,
__global const char *replacementKey,
__global const float *floatOptions,
__global const int *intOptions,
__global char *tmpActive,
__global char *tmpStale
) {
float tolerance = floatOptions[TOLERANCE];
float flowKeyTolerance = floatOptions[FLOW_KEY_TOLERANCE];
int colorSpace = intOptions[COLOR_SPACE];
int width = intOptions[WIDTH];
int height = intOptions[HEIGHT];
int initialNoiseReductionIterations = intOptions[INITIAL_NOISE_REDUCTION_ITERATIONS];
int flowKeyIterations = intOptions[FLOW_KEY_ITERATIONS];
int finalNoiseReductionIterations = intOptions[FINAL_NOISE_REDUCTION_ITERATIONS];
int gid = get_global_id(0);
if (gid % 3 == 0) {
applyInitialComparison(input, tmpActive, colorKey, replacementKey, tolerance, colorSpace, gid);
writePixel(tmpStale, gid, tmpActive, gid);
for (int i = 0; i < initialNoiseReductionIterations; i++) {
applyNoiseReduction(tmpStale, tmpActive, input, replacementKey, width, height, gid);
writePixel(tmpStale, gid, tmpActive, gid);
}
for (int i = 0; i < flowKeyIterations; i++) {
applyFlowKey(tmpStale, tmpActive, input, replacementKey, flowKeyTolerance, colorSpace, width, height, gid);
writePixel(tmpStale, gid, tmpActive, gid);
}
for (int i = 0; i < finalNoiseReductionIterations; i++) {
applyNoiseReduction(tmpStale, tmpActive, input, replacementKey, width, height, gid);
writePixel(tmpStale, gid, tmpActive, gid);
}
writePixel(output, gid, tmpStale, gid);
}
}
That being said, the aggregate kernel ran easily 1,000 times faster than the split kernels (I implemented a little frame-latency counter). This signals to me that the overhead of queueing a kernel is way too large as compared to the task at hand.
What can I do to optimize this program? Is there a way to efficiently queue many small kernels? Is there a way to restructure the kernels to run concurrently? Please also let me know how I can improve the quality of my question if needed.
Thanks!
For every kernel launch there is a fixed amount of overhead, let's say 1 Millisecond. The overhead originates partly in the loading of the instructions, but mainly in the synchronization of all threads at the end of each kernel. So if you launch lots of small kernels that take 1ms execution time each, half of the total time will be lost as overhead. If you aggregate many small kernels into one that runs 9ms, then overhead is only 10%.
So aggregate as many small kernels into one as data movement allows without running into race conditions. Also make sure the range of each kernel is as large as possible.
Alternatively, you could use multiple queues in parallel. A kernel with small range does not saturate the GPU, so part of the GPU is idle at any time. If you have multiple concurrent queues, the kernels in these queues can run concurrently and together saturate the hardware. However with this you still have the overhead losses.