multithreadingubuntuaudiovirtualboxogg-theora

Why are 2 threads slower than 1 when encoding with theora in emulated ubuntu?


I was trying to use a multithreaded vorbis encoding but it seems that with an emulated ubuntu with virtualbox the multithreaded test is actually slower when trying to use 2 threads instead of 1. I took the svn theora-multithread version of the theora encoder, as described in this question. My hardware is intel i7 haswell with 2 cores and I've configured VirtualBox for 2 CPUs. Why are the results not as expected? I would expect multithreaded to encode faster but it's much slower.

developer@developer-VirtualBox:~/theora-multithread/examples$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 69
Stepping:              1
CPU MHz:               0.000
BogoMIPS:              1687.55
L1d cache:             32K
L1d cache:             32K
L2d cache:             6144K
NUMA node0 CPU(s):     0,1
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 1 wavesound.wav tmp.yuv -o TEST-1-thread.ogg
File wavesound.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 1
Compressing....
      0:46:32.08 audio: 66kbps video: 3kbps                 
done.


real    0m23.907s
user    0m12.319s
sys 0m1.623s
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 2 wavesound.wav tmp.yuv -o TEST-2-thread.ogg
File wavesound.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 2
Compressing....
      0:46:32.08 audio: 66kbps video: 3kbps                 
done.


real    1m7.882s
user    0m22.370s
sys 0m33.304s
developer@developer-VirtualBox:~/theora-multithread/examples$ 

CPU-Z in the host OS (Win 8.1) reports the following about the hardware.

Processor 1         ID = 0
    Number of cores     2 (max 8)
    Number of threads   4 (max 16)
    Name            Intel Core i3/i5/i7 4xxx
    Codename        Haswell ULT
    Specification       Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    Package (platform ID)   Socket 1168 BGA (0x6)
    CPUID           6.5.1
    Extended CPUID      6.45
    Core Stepping       C0
    Technology      22 nm
    TDP Limit       28 Watts
    Tjmax           100.0 °C
    Core Speed      798.4 MHz
    Multiplier x Bus Speed  8.0 x 99.8 MHz
    Stock frequency     2800 MHz
    Instructions sets   MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3
    L1 Data cache       2 x 32 KBytes, 8-way set associative, 64-byte line size
    L1 Instruction cache    2 x 32 KBytes, 8-way set associative, 64-byte line size
    L2 cache        2 x 256 KBytes, 8-way set associative, 64-byte line size
    L3 cache        4 MBytes, 16-way set associative, 64-byte line size
    FID/VID Control     yes

Test 2

Testing with a larger audio file (the video is just dummy video created from a static png) then the difference is not that large(?).

developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 1 140715MIX.wav tmp.yuv -o OGGTEST-1-threadv1.ogg
File 140715MIX.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 1
Compressing....
      0:53:30.76 audio: 73kbps video: 3kbps                 
done.


real    5m20.807s
user    3m21.943s
sys 0m6.727s
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 2 140715MIX.wav tmp.yuv -o OGGTEST-2-threadv2.ogg
File 140715MIX.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 2
Compressing....
      0:53:30.76 audio: 73kbps video: 3kbps                 
done.


real    6m8.159s
user    3m45.750s
sys 0m27.579s

Test 2(video only)

testing video only then I could reproduce a speedup using the number of threads:

developer@developer-VirtualBox:~/theora-multithread$ time ./examples/encoder_example -v 1 -a 1 --number-of-threads 1 stream.yuv > theora_testfile_1.ogg
File stream.yuv is 320x240 15.00 fps YUV12 video.
Number of Threads: 1
Compressing....
      0:00:07.60 audio: 0kbps video: 138kbps                 
done.


real    0m2.136s
user    0m1.920s
sys 0m0.083s
developer@developer-VirtualBox:~/theora-multithread$ time ./examples/encoder_example -v 1 -a 1 --number-of-threads 2 stream.yuv > theora_testfile_2.ogg
File stream.yuv is 320x240 15.00 fps YUV12 video.
Number of Threads: 2
Compressing....
      0:00:07.60 audio: 0kbps video: 139kbps                 
done.


real    0m2.043s
user    0m1.994s
sys 0m0.175s

Solution

  • If I had to guess, it's because the overhead of creating the threads, and context switching, is more expensive than the process itself.

    Keep in mind that kernel threads are much more expensive than user thread. If you can, avoid kernel level threading.

    For better performance, try to execute larger tasks concurrently and avoid operations that trigger context switching (like waiting on resources or blocking).

    Also, reuse thread resources. Creating a new thread for each task can affect the performance of your application. Pooling threads can help avoid the overhead of creating them.