let say I want to perform a running horizontal average on x-axis of an image.
Func g;
g(x,y) = (img(x-1,y) + img(x,y) + img(x+1,y))/3.f;
h(x,y) = cast<uint8_t>(g(x,y) + 0.5f);
Using float32 for g(x,y) seems to be overkill but I do care
about precision so an integer division is not preferred.
Can I use float16_t instead of float32_t to gain more throughput ?
Could it be done in in this way ?
Expr three = <cast>(float16_t(3.f));
Expr point5 = <cast>float16_t(0.5f);
g(x,y) = (img(x-1,y) + img(x,y) + img(x+1,y))/three;
h(x,y) = cast<uint8_t>(g(x,y) + point5);
I'm going to use an auto scheduler to do the job. It seems that avx2 has to ability to process float16_t in parallel. Will there be a problem if this piece of code be generated with the target of x86_64-sse4.1 ?
float16 conversions exist on avx2, but it doesn't actually do float16 math in parallel, so it'll be slow. I recommend using uint16 instead for this sort of thing. It's actually more precise than using floats for the code you've given:
Func in16, g;
in16(x, y) = cast<uint16_t>(img(x, y));
g(x,y) = in16(x-1,y) + in16(x,y) + in16(x+1,y);
h(x,y) = cast<uint8_t>(g(x,y) + 1)/3);
The division operation will use the x86 vector instruction pmulhuw, so it'll be fast.