My function that computes Lorentzian given freq, fwhm, amp. I want to vectorize it so that it does the computation for a list of freqs, fwhms and amps:
def lorz1(freq_series, freq, fwhm, amp):
numerator = fwhm
denominator = (2*np.pi) * ((freq_series[:,None] - freq)**2 + fwhm**2/4)
lor = numerator / denominator
main_peak = amp*(lor/np.linalg.norm(lor, axis=0))
return np.sum(main_peak, axis=1)
def lorz2(freq_series, freq, fwhm, amp):
numerator = fwhm[:,None]
denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
lor = numerator / denominator
main_peak = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
return np.sum(main_peak, axis=0)
def lorz3(freq_series, freq, fwhm, amp):
numerator = fwhm
denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
lor = numerator / denominator
main_peak = amp*(lor/np.linalg.norm(lor))
return main_peak
series = np.linspace(0,100,50000)
freq = np.random.uniform(5,50,50)
fwhm = np.random.uniform(0.01,0.05,50)
amps = np.random.uniform(5,500,50)
Timing:
%timeit lorz1(series, freq, fwhm, amps)
38.4 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit lorz2(series, freq, fwhm, amps)
29.8 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(np.array([lorz3(series, item1, item2, item3)
for (item1,item2,item3) in zip(freq, fwhm, amps)]), axis=0)
24.1 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Where am I going wrong with the vectorization in lorz1
and lorz2
? Aren't they supposed to be faster than lorz3
?
I did some further profiling using two versions:
Version 1:
#!/usr/bin/env python3
import numpy as np
def lorz2(freq_series, freq, fwhm, amp):
numerator = fwhm[:,None]
denominator = (2*np.pi) * ((freq_series - freq[:,None])**2 + fwhm[:,None]**2/4)
lor = numerator / denominator
main_peak = amp[:,None]*(lor/np.linalg.norm(lor, axis=1)[:,None])
return np.sum(main_peak, axis=0)
series = np.linspace(0,100,50000)
freq = np.random.uniform(5,50,50)
fwhm = np.random.uniform(0.01,0.05,50)
amps = np.random.uniform(5,500,50)
for _ in range(100):
lorz2(series, freq, fwhm, amps)
and version 2:
#!/usr/bin/env python3
import numpy as np
def lorz3(freq_series, freq, fwhm, amp):
numerator = fwhm
denominator = (2*np.pi) * ((freq_series - freq)**2 + fwhm**2/4)
lor = numerator / denominator
main_peak = amp*(lor/np.linalg.norm(lor))
return main_peak
series = np.linspace(0,100,50000)
freq = np.random.uniform(5,50,50)
fwhm = np.random.uniform(0.01,0.05,50)
amps = np.random.uniform(5,500,50)
for _ in range(100):
sum(lorz3(series, item1, item2, item3)
for (item1,item2,item3) in zip(freq, fwhm, amps))
Notice how I tweaked the summation for lorz3
into a plain old Python sum
. This is faster in my tests since it avoids the temporary array construction.
Here are the results of some profiling I did:
perf stat -ddd ./lorz2.py
Performance counter stats for './lorz2.py':
2729.16 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
217114 page-faults:u # 79.554 K/sec
8192141440 cycles:u # 3.002 GHz (38.43%)
3178961202 instructions:u # 0.39 insn per cycle (46.12%)
426575242 branches:u # 156.303 M/sec (53.81%)
2177628 branch-misses:u # 0.51% of all branches (61.51%)
42020185035 slots:u # 15.397 G/sec (69.20%)
323473974 topdown-retiring:u # 0.6% Retiring (69.20%)
33616148028 topdown-bad-spec:u # 67.1% Bad Speculation (69.20%)
371211166 topdown-fe-bound:u # 0.7% Frontend Bound (69.20%)
15767347418 topdown-be-bound:u # 31.5% Backend Bound (69.20%)
813550722 L1-dcache-loads:u # 298.096 M/sec (69.19%)
546814255 L1-dcache-load-misses:u # 67.21% of all L1-dcache accesses (69.21%)
82889242 LLC-loads:u # 30.372 M/sec (69.22%)
67633317 LLC-load-misses:u # 81.59% of all LL-cache accesses (69.24%)
<not supported> L1-icache-loads:u
9705348 L1-icache-load-misses:u # 0.00% of all L1-icache accesses (30.81%)
864895659 dTLB-loads:u # 316.909 M/sec (30.79%)
117310 dTLB-load-misses:u # 0.01% of all dTLB cache accesses (30.78%)
<not supported> iTLB-loads:u
85530 iTLB-load-misses:u # 0.00% of all iTLB cache accesses (30.76%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
2.729696014 seconds time elapsed
1.932708000 seconds user
0.796504000 seconds sys
And here the faster version:
Performance counter stats for './lorz3.py':
878.49 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
52869 page-faults:u # 60.182 K/sec
3704170220 cycles:u # 4.217 GHz (38.22%)
3735225800 instructions:u # 1.01 insn per cycle (45.96%)
568575253 branches:u # 647.221 M/sec (53.70%)
2580294 branch-misses:u # 0.45% of all branches (61.43%)
18355798588 slots:u # 20.895 G/sec (69.17%)
3328525030 topdown-retiring:u # 17.5% Retiring (69.17%)
6982401815 topdown-bad-spec:u # 36.6% Bad Speculation (69.17%)
1297505291 topdown-fe-bound:u # 6.8% Frontend Bound (69.17%)
7459691283 topdown-be-bound:u # 39.1% Backend Bound (69.17%)
858082535 L1-dcache-loads:u # 976.773 M/sec (69.28%)
430569310 L1-dcache-load-misses:u # 50.18% of all L1-dcache accesses (69.40%)
15723297 LLC-loads:u # 17.898 M/sec (69.49%)
73709 LLC-load-misses:u # 0.47% of all LL-cache accesses (69.50%)
<not supported> L1-icache-loads:u
38705486 L1-icache-load-misses:u # 0.00% of all L1-icache accesses (30.72%)
860276161 dTLB-loads:u # 979.270 M/sec (30.60%)
86213 dTLB-load-misses:u # 0.01% of all dTLB cache accesses (30.51%)
<not supported> iTLB-loads:u
91069 iTLB-load-misses:u # 0.00% of all iTLB cache accesses (30.50%)
<not supported> L1-dcache-prefetches:u
<not supported> L1-dcache-prefetch-misses:u
0.878946776 seconds time elapsed
0.852205000 seconds user
0.026744000 seconds sys
Notice how the number of instructions is actually slightly higher in the faster code, which makes sense since it is less vectorized, but the much higher instructions per cycle make it faster overall. There are twice as many LLC loads in the slower version, of which most miss while here almost all hit. I'm not sure how to interpret the topdown-bad-spec
counter. Maybe someone else can comment on that.
The CPU even clocks down (this is reproducible) which supports the idea that it is simply waiting on memory.
Further, notice the sys
time in the last line. lorz2
spends 28% of its runtime in kernel space. Since it doesn't do anything IO-related, that is all memory allocation and deallocation overhead.
We can look a bit further at the stall reasons:
perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz2.py
Performance counter stats for './lorz2.py':
8446540078 cycles:u
1953955881 l1d_pend_miss.l2_stall:u
3050292324 cycle_activity.stalls_l3_miss:u
2.748141433 seconds time elapsed
1.959570000 seconds user
0.788443000 seconds sys
perf stat -e cycles,l1d_pend_miss.l2_stall,cycle_activity.stalls_l3_miss ./lorz3.py
Performance counter stats for './lorz3.py':
3674547216 cycles:u
303870088 l1d_pend_miss.l2_stall:u
16939496 cycle_activity.stalls_l3_miss:u
0.869909182 seconds time elapsed
0.848122000 seconds user
0.021752000 seconds sys
So, the lorz2
version just stalls constantly on level 2 or 3 cache misses.
We can further look at a simple perf report
perf record ./lorz2.py
perf report
# Overhead Command Shared Object Symbol
# ........ ....... ................................................. ...............................................
#
32.44% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_multiply
27.03% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_divide
17.34% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_add
6.93% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_subtract
6.12% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_pairwise_sum
5.32% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_square
0.51% python3 [unknown] [k] 0xffffffffb2800fe7
0.46% python3 libpython3.11.so.1.0 [.] _PyEval_EvalFrameDefault
0.19% python3 libpython3.11.so.1.0 [.] unicodekeys_lookup_unicode
0.18% python3 libpython3.11.so.1.0 [.] gc_collect_main
...
perf record ./lorz3.py
perf report
# Overhead Command Shared Object Symbol
# ........ ....... ................................................. .............................................
#
27.56% python3 libblas.so.3.11.0 [.] ddot_
27.47% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_divide
8.64% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_subtract
8.34% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_add
5.84% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_multiply
3.70% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] DOUBLE_square
1.47% python3 libpython3.11.so.1.0 [.] _PyEval_EvalFrameDefault
1.38% python3 libgcc_s.so.1 [.] execute_cfa_program
0.89% python3 libgcc_s.so.1 [.] uw_update_context_1
0.88% python3 libgcc_s.so.1 [.] _Unwind_Find_FDE
0.64% python3 libgcc_s.so.1 [.] uw_frame_state_for
0.61% python3 _multiarray_umath.cpython-311-x86_64-linux-gnu.so [.] ufunc_generic_fastcall
...
Huh, that's interesting. Where does the dot product come from? I assume this is how linalg.norm
is implemented for simple vectors.
Incidentally, we can speed up the lorz2
and lorz3
versions slightly via 3 measures:
def lorz2a(freq_series, freq, fwhm, amp):
numerator = fwhm[:,None] * (0.5 / np.pi)
denominator = (freq_series - freq[:,None])**2 + 0.25 * fwhm[:,None]**2
lor = numerator / denominator
return (amp / np.linalg.norm(lor, axis=1)) @ lor
def lorz3a(freq_series, freq, fwhm, amp):
numerator = fwhm * (0.5 / np.pi)
denominator = (freq_series - freq)**2 + 0.25 * fwhm**2
lor = numerator / denominator
main_peak = amp / np.linalg.norm(lor) * lor
return main_peak
This does not change anything on the overall trends, however.
Numpy vectorization primarily helps reducing per-call overhead. Once the arrays are large enough, we don't get much benefit from it since the remaining interpreter overhead is small compared to the computations itself. Simultaneously, larger arrays result in reduced memory efficiency. Typically there is a sweet-spot somewhere around L2 or L3 cache size. The lorz3
implementation hits this spot better than the others.
For a smaller series size and a larger size of the other arrays, we can expect lorz2
to perform better. For example this data set makes my lorz2a
faster than my lorz3a
:
series = np.linspace(0,100,1000)
freq = np.random.uniform(5,50,2000)
fwhm = np.random.uniform(0.01,0.05,2000)
amps = np.random.uniform(5,500,2000)
Numpy's simple, eager evaluation scheme puts the onus of tuning for this on the user. Other libraries like NumExpr try to avoid this.