I get the same results in both Julia and Python. Singular Value Decomposition is slower on the GPU than on the CPU for Float64 arrays. (Float32 arrays behave how one would expect, with the GPU being faster.)
Python benchmark code using PyTorch:
import time
import torch
from torch.linalg import svd
f64 = torch.double
f32 = torch.float
cpu = torch.device("cpu")
gpu = torch.device("cuda")
# X = torch.rand(5_000, 10_000, dtype=f32)
X = torch.rand(5_000, 10_000, dtype=f64)
X_cpu = X.to(cpu)
X_gpu = X.to(gpu)
print(X_cpu.type())
# Warmup
U, Sig, Vt = svd(X_cpu, full_matrices = True)
# Timed run (CPU)
t1 = time.perf_counter()
U, Sig, Vt = svd(X_cpu, full_matrices = True)
t2 = time.perf_counter()
print(U.type())
print(X_cpu.type())
# Warmup
U, Sig, Vt = svd(X_gpu, full_matrices = True)
# Timed run (GPU)
t3 = time.perf_counter()
U, Sig, Vt = svd(X_gpu, full_matrices = True)
t4 = time.perf_counter()
print(U.type())
print(f"Time CPU (s): {t2-t1}")
print(f"Time GPU (s): {t4-t3}")
For the above Float64 array I get:
Time CPU (s): 14.52491476599971
Time GPU (s): 56.79755901500175
If I use a Float32 array instead I get the much more reasonable looking:
Time CPU (s): 9.301500292000128
Time GPU (s): 6.969021153003268
Although it's still a little surprising that using the GPU only speeds things up by a couple of seconds.
The Julia code gives similar results:
using LinearAlgebra
using Flux
using CUDA
using cuDNN
X = rand(5_000, 10_000)
println("typeof(X): $(typeof(X))")
# Warmup
U, Sig, V = LinearAlgebra.svd(X)
# Timed run
t1 = time_ns()
U, Sig, V = LinearAlgebra.svd(X)
t2 = time_ns()
println("typeof(U): $(typeof(U))")
X_gpu = X |> gpu |> f64
println("typeof(X_gpu): $(typeof(X_gpu))")
# Warmup
U, Sig, V = CUDA.svd!(X_gpu)
# Timed run
t3 = time_ns()
U, Sig, V = CUDA.svd!(X_gpu)
t4 = time_ns()
println("typeof(U): $(typeof(U))")
println("Time CPU (s): $((t2-t1)/1e9)")
println("Time GPU (s): $((t4-t3)/1e9)")
For this Float64 array the GPU again takes much longer than the CPU:
Time CPU (s): 28.641290506
Time GPU (s): 57.069009417
However, switching to Float32 again yields a reasonable result:
Time CPU (s): 15.096364932
Time GPU (s): 7.283513658
Two questions:
full_matrices=False
for PyTorch's svd function, but I got the same results.All consumer level NVidia GPUs have limited support for 64-bit floating point operations.
Whereas every warp has 32 float32 cores, it only has a single float64 core.
In my experiments I typically find that float64 code runs about 8x slower (more if the floating point operations are heavily optimized). The slowdown is not typically 32 times as you'd expect due to latency issues and memory fetch times etc.
You really need to avoid double. If you really need the extra precision (which is rare), look at "kahan summation".
NVidia documents this in the CUDA programming guide
Compute Capability
Throughput/cycle/SM | 5.0, 5.2 | 5.3 | 6.0 | 6.1 | 6.2 | 7.x | 8.0 | 8.6 | 8.9 | 9.0 |
---|---|---|---|---|---|---|---|---|---|---|
16-bit floating-point | N/A | 256 | 128 | 2 | 256 | 128 | 256 | 256 | 128 | 256 |
32-bit floating-point | 128 | 128 | 64 | 128 | 128 | 64 | 64 | 128 | 128 | 128 |
64-bit floating-point | 4 | 4 | 32 | 4 | 4 | 32 | 32 | 2 | 2 | 64 |
Note how on all non-server hardware the double
throughput is severely limited. This limits the amount of chip real-estate needed to support rarely used 64-bit operations, which allows NVidia to make consumer GPU using smaller and cheaper dies (or alternatively spend more silicon optimizing 32-bit performance).
The compute capability numbers refer to the microarchiture, 8.9 is Ada Lovelace (aka RTX 4000 series), 8.6 is Ampere (aka RTX 3000 series) and so on, see: https://en.wikipedia.org/wiki/CUDA
As for better performance, have a look at different libraries. Esp have a look at python wrappers around native CUDA libraries (such as CuSolver) that are optimized for the GPU.