I was trying to use a multithreaded vorbis encoding but it seems that with an emulated ubuntu with virtualbox the multithreaded test is actually slower when trying to use 2 threads instead of 1. I took the svn theora-multithread version of the theora encoder, as described in this question. My hardware is intel i7 haswell with 2 cores and I've configured VirtualBox for 2 CPUs. Why are the results not as expected? I would expect multithreaded to encode faster but it's much slower.
developer@developer-VirtualBox:~/theora-multithread/examples$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Stepping: 1
CPU MHz: 0.000
BogoMIPS: 1687.55
L1d cache: 32K
L1d cache: 32K
L2d cache: 6144K
NUMA node0 CPU(s): 0,1
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 1 wavesound.wav tmp.yuv -o TEST-1-thread.ogg
File wavesound.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 1
Compressing....
0:46:32.08 audio: 66kbps video: 3kbps
done.
real 0m23.907s
user 0m12.319s
sys 0m1.623s
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 2 wavesound.wav tmp.yuv -o TEST-2-thread.ogg
File wavesound.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 2
Compressing....
0:46:32.08 audio: 66kbps video: 3kbps
done.
real 1m7.882s
user 0m22.370s
sys 0m33.304s
developer@developer-VirtualBox:~/theora-multithread/examples$
CPU-Z in the host OS (Win 8.1) reports the following about the hardware.
Processor 1 ID = 0
Number of cores 2 (max 8)
Number of threads 4 (max 16)
Name Intel Core i3/i5/i7 4xxx
Codename Haswell ULT
Specification Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Package (platform ID) Socket 1168 BGA (0x6)
CPUID 6.5.1
Extended CPUID 6.45
Core Stepping C0
Technology 22 nm
TDP Limit 28 Watts
Tjmax 100.0 °C
Core Speed 798.4 MHz
Multiplier x Bus Speed 8.0 x 99.8 MHz
Stock frequency 2800 MHz
Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3
L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size
L2 cache 2 x 256 KBytes, 8-way set associative, 64-byte line size
L3 cache 4 MBytes, 16-way set associative, 64-byte line size
FID/VID Control yes
Testing with a larger audio file (the video is just dummy video created from a static png) then the difference is not that large(?).
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 1 140715MIX.wav tmp.yuv -o OGGTEST-1-threadv1.ogg
File 140715MIX.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 1
Compressing....
0:53:30.76 audio: 73kbps video: 3kbps
done.
real 5m20.807s
user 3m21.943s
sys 0m6.727s
developer@developer-VirtualBox:~/theora-multithread/examples$ time ./encoder_example --number-of-threads 2 140715MIX.wav tmp.yuv -o OGGTEST-2-threadv2.ogg
File 140715MIX.wav is 16 bit 2 channel 44100 Hz RIFF WAV audio.
File tmp.yuv is 48x48 25.00 fps YUV12 video.
Number of Threads: 2
Compressing....
0:53:30.76 audio: 73kbps video: 3kbps
done.
real 6m8.159s
user 3m45.750s
sys 0m27.579s
testing video only then I could reproduce a speedup using the number of threads:
developer@developer-VirtualBox:~/theora-multithread$ time ./examples/encoder_example -v 1 -a 1 --number-of-threads 1 stream.yuv > theora_testfile_1.ogg
File stream.yuv is 320x240 15.00 fps YUV12 video.
Number of Threads: 1
Compressing....
0:00:07.60 audio: 0kbps video: 138kbps
done.
real 0m2.136s
user 0m1.920s
sys 0m0.083s
developer@developer-VirtualBox:~/theora-multithread$ time ./examples/encoder_example -v 1 -a 1 --number-of-threads 2 stream.yuv > theora_testfile_2.ogg
File stream.yuv is 320x240 15.00 fps YUV12 video.
Number of Threads: 2
Compressing....
0:00:07.60 audio: 0kbps video: 139kbps
done.
real 0m2.043s
user 0m1.994s
sys 0m0.175s
If I had to guess, it's because the overhead of creating the threads, and context switching, is more expensive than the process itself.
Keep in mind that kernel threads are much more expensive than user thread. If you can, avoid kernel level threading.
For better performance, try to execute larger tasks concurrently and avoid operations that trigger context switching (like waiting on resources or blocking).
Also, reuse thread resources. Creating a new thread for each task can affect the performance of your application. Pooling threads can help avoid the overhead of creating them.