I'm trying to implement demosaic algorithm for Bayer filter as shown in section 2.8 (page 8) of this pdf: http://www.arl.army.mil/arlreports/2010/ARL-TR-5061.pdf. I'm stuck on trying to realize a function over an RDom. When I use a 16x16 image, the traces actually finish, but when I use a larger image like 768x1280, the trace gets stuck on:
Store green.0(767, 1279);
Following is the simplified version of my code:
#include "Halide.h"
#include<stdio.h>
#include<stdlib.h>
#include "halide_image_io.h"
using namespace Halide;
int main(int argc, char **argv) {
Buffer<uint8_t> input = Tools::load_image(argv[1]);
RDom r(2, input.width() - 4, 2, input.height() - 4);
r.where((r.x % 2 == 0 && r.y % 2 == 0) || (r.x % 2 == 1 && r.y % 2 == 1));
Var x("x"), y("y");
Func g_n, w_n, g_n_est, green("green");
g_n(x, y) = cast<float> (0);
w_n(x, y) = cast<float> (0);
g_n_est(x, y) = cast<float> (0);
green(x, y) = cast<uint8_t> (0);
printf("width: %d\n", input.width());
printf("height: %d\n", input.height());
printf("channels: %d\n", input.channels());
g_n(r.x, r.y) = abs(cast<float>(input(r.x, r.y + 1) - input(r.x, r.y - 1))) + abs(cast<float>(input(r.x, r.y) - input(r.x, r.y - 2)));
w_n(r.x, r.y) = cast<float>(1 / (1 + g_n(r.x, r.y)));
g_n_est(r.x, r.y) = cast<float>(input(r.x, r.y - 1) + (input(r.x, r.y) - input(r.x, r.y - 2))) / 2;
green(r.x, r.y) = cast<uint8_t>(w_n(r.x, r.y) * g_n_est(r.x, r.y));
green.trace_stores();
Buffer<uint8_t> temp = green.realize(input.width(), input.height());
Tools::save_image(temp, "result.png");
}
Is this a bug in Halide? In this instance, the code finishes executing and saves the output image for a 16x16 input but gets stuck in the trace for larger images.
This is just a really, really inefficient schedule. Every stage each compute O(n) pixels in their update definitions any time they are realized (that's now big the RDom r
is), but the stages are also each inlined into the next. As a result, every point in green
is recursively computing a whole image of g_n_est
and w_n
, and then for each of their pixels it's recursively computing a whole image of g_n
.
What you see as a stall at green.0(767, 1023)
is actually right after it has finished computing the pure definition of green(x,y) = 0
for the last pixel, at which point it starts taking forever to actually compute all of the update stages because of the O(n^3) work it's doing.
This is a case where getting aggressive turning on more tracing would make the problem clearer. You can turn tracing of realizations or individual stores on globally when you configure the compile: https://github.com/halide/Halide/wiki/Debugging-Tips#tracing.
For this code, scheduling the earlier stages as compute_root
may be what you want, though you may actually want the g_n_est
and w_n
definitions to be simple pure functions (no RDom
s at all) that can be fused into green
, scheduled in blocks, etc.