I have written a simple video conferencing app which uses multiple threads for video and audio mixing. I use libavcodec (ffmpeg) codecs for mixing video. As I know, libavcodec uses SSE instructions to achieve high performance. For audio mixing, I'm using a simple mixing algorithm which just adds the samples. I have written the adding algorithm with a sipmle for
loop in C++, but now I want to optimize it using SSE instructions like this:
__m128i* d = (__m128i*) pOutBuffer;
__m128i* s = (__m128i*) pInBuffer;
for (DWORD n = (DWORD)(nSizeToMix + 7) >> 3; n != 0; --n, ++d, ++s)
{
//Load data in SSE registers
__m128i xmm1 = _mm_load_si128(d);
__m128i xmm2 = _mm_load_si128(s);
//SSE2 sum
_mm_store_si128(d, _mm_add_epi16(xmm1, xmm2));
}
Audio mixing is done is a separate thread simultaneously with video mixing. When I use SSE instructions, the app crashes suddenly in a position unrelated to audio mixing, in encoding/decoding of video.
It seems because libavcodec uses SSE registers and instructions, my code conflicts with it. Is there any way to use SSE instructions without any conflicts with libvcodec (ffmpeg)? Any suggestions appreciated.
Context switches should be OK as long as you're using modern compiler (newer than 10 years old) and you aren't coding in assembly. Compilers know the ABIs for their target platforms so you don't have to.
If you've included the exact code that crashed your app, the most likely reason is alignment issues. Replace _mm_load_si128
with _mm_loadu_si128
, _mm_store_si128
with _mm_storeu_si128
and see if it helps.
Update 1: another possible reason is SSE version completes too fast and this triggers a concurrency bug. Try adding e.g. Sleep( 2 )
call after the loop, if video will work OK it means you need to fix the code that pushes or pulls the data across threads.
Update 2: As Alan pointed out, the size of arrays (buffers) may not be a multiple of 16 bytes (16 * (nSizeToMix + 7) / 8
). This will surely cause your app crash or have memory corruptions.