Could anyone help me with this problem?
I'm trying to achieve more efficiency that's why I tried to parallel my calculations. After a few tests the results showed me that nothing is faster than calculation on 1 thread. its only 25% of processor load in both cases (1 thread and 4 threads). Could any one know why it's happening? Is there anything I can do to achieve 100% of efficiency (even 90% be better than 25%)?
Below you can see an example code:
ToolsThread = class(TThread)
public
procedure Execute(); override;
procedure QuickSortT(var dict: TArray<AnsiString>; iLo, iHi: Integer);
Procedure QSortT(var dict: TArray<AnsiString>);
constructor Create();
var
tab : TArray<AnsiString>;
tmp1: Longint;
end;
procedure ToolsThread.QuickSortT(var dict: TArray<AnsiString>; iLo, iHi: Integer);
var
Lo, Hi: Longint;
Pivot: Pointer;
T: Pointer;
begin
Lo := iLo;
Hi := iHi;
Pivot := pointer(dict[(Lo + Hi) shr 1]); // shr 1 is slightly faster than div 2;
repeat
while dict[Lo] < AnsiString(Pivot) do Inc(Lo);
while dict[Hi] > AnsiString(Pivot) do Dec(Hi);
if Lo <= Hi then
begin
T := pointer(dict[Lo]);
pointer(dict[Lo]) := pointer(dict[Hi]);
pointer(dict[Hi]) := T;
Inc(Lo) ;
Dec(Hi) ;
end;
until Lo > Hi;
if Hi > iLo then QuickSort(dict, iLo, Hi) ;
if Lo < iHi then QuickSort(dict, Lo, iHi) ;
end;
Procedure ToolsThread.QSortT(var dict: TArray<AnsiString>);
begin
QuickSort(dict, 0, Length(dict)-1);
end;
procedure ToolsThread.Execute();
var
tmp1, tmp2 : Longint;
dict: TArray<AnsiString>;
begin
SetLength(dict, 10000000);
for tmp1:= 0 to 10000000-1 do
dict[tmp1] := IntToStr(Random(high(integer)));
QSortT(dict);
end;
Procedure Main;
var
Th1, Th2, Th3, Th4: ToolsThread;
begin
Th1 := ToolsThread.Create();
Th2 := ToolsThread.Create();
Th3 := ToolsThread.Create();
Th4 := ToolsThread.Create();
debug('Start THR');
Th1.Start;
Th2.Start;
Th3.Start;
Th4.Start;
th1.WaitFor;
th2.WaitFor;
th3.WaitFor;
th4.WaitFor;
debug('THR Done');
end;
Corrected according suggestions. Still 25% CPU load (5-8% per thread)
SOLVED! There is a general problem with some Delphi memory management in multiprocessing. It's not fastMM4 problem and it is resolvable only as workaround for now.
First of all, use string
everywhere. Don't mess with AnsiString
and string
types - unless you are with a pre-Delphi 2009... And is you use AnsiString
, then IntToStr()
will return a plain string
which will then be reallocated and converted into an AnsiString
when put in the array...
From what I can see in your code, my guess is that most of the time is spent into calling IntToStr()
, then converting the returned string
into AnsiString
, which will involve the heap manager, which will be mostly serialized in the process. Of course, it is a quick guess, with my "brain profiler" (tm) which uses to be wrong.
A QuickSort algorithm is O(n*log(n)) which is faster than O(n) calls of IntToStr()
from multiple threads.
Try putting only QSortT(dict)
in the Execute
method. It should be almost linear.
Another guess is that the following lines are not optimal. They will involve reference counting when swapping strings (assigning strings is in fact a call to an hidden _UStrAsg()
low function), which takes a lot of CPU power due to a "lock" on the reference counter. If multiple threads share some CPU cache during the "lock" (which may occur in your case, since IntToStr
allocations are done in parallel), then this "lock" would create some cache contention, which could take some noticeable time, even on newer CPUs.
The following may be faster (to be verified):
var T: pointer;
...
if Lo <= Hi then
begin
T := pointer(dict[Lo]);
pointer(dict[Lo]) := pointer(dict[Hi]);
pointer(dict[Hi]) := T;
Inc(Lo) ;
Dec(Hi) ;
end;
You may also replace pivot
with a pointer
but it is a bit more tricky:
var pivot: pointer;
...
repeat
Pivot := pointer(dict[(Lo + Hi) shr 1]); // within the 2nd repeat..until
while dict[Lo] < string(Pivot) do Inc(Lo) ;
while dict[Hi] > string(Pivot) do Dec(Hi) ;
Idea is always to use real profiling in such cases. At least, use precision timer/watch in the code you want to measure, and see where the CPU goes into.
Edit: I have written some code using my advices - see https://gist.github.com/synopse/02eb142d35cb44126aed9fd3200a76d1
output is on a 2 core CPU:
Fill done in 3.25s
Sort done in 24.11s