
There is no difference in search speed between CPU and GPU on a million-level dataset


I have one million data, the vector dimension is 1536, and I hope to use GPU to speed up vector query and search

Resource Information

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          104
On-line CPU(s) list:             0-103
Thread(s) per core:              2
Core(s) per socket:              26
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
Stepping:                        6
CPU MHz:                         2800.167
BogoMIPS:                        4400.00
Virtualization:                  VT-x
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:17:00.0 Off |                    0 |
| N/A   40C    P0             64W /  300W |       1MiB /  81920MiB |      1%      Default |
|                                         |                        |             Disabled |
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:CA:00.0 Off |                    0 |
| N/A   42C    P0             69W /  300W |       1MiB /  81920MiB |      3%      Default |
|                                         |                        |             Disabled |

My steps

from pymilvus import MilvusClient, DataType
import time
import numpy as np
import string
import random

milvus_uri = ""
collection_name = ""
client = MilvusClient(uri=milvus_uri)

search_params = {
   "metric_type": "L2",
   "params": {"nprobe": 32},
vectors_to_search = [np.random.rand(1536).tolist() ]
start_time = time.time()
result =
       filter = filter_expr,
end_time = time.time()
print(f"time cost {end_time-start_time}")

GPU IVF FLAT: nprobe: 32

Concurrency QPS
1 681
5 594
10 546

CPU IVF FLAT: nprobe: 32

Concurrency QPS
1 680
5 609
10 580

My question:

Why does the GPU not have an acceleration effect? ​​Please help me see if there is anything wrong with the above operation.


  • For CPU index, the time cost of a search request includes the following:

    For GPU index, the time cost of a search request includes the following:

    We can see there are extra time costs for copying data between CPU memory and GPU memory.

    The advantage of GPU search is large NQ search because GPU has strong parallel computing ability.

    For small datasets and small NQ searches, it is no much difference between CPU index and GPU index.

    A large NQ search is like this:

    NQ = 10000
    target_vectors = []
    for i in range(NQ):
    results =

    You can try increasing the "Concurrency", or higher NQ value for each request.

    NQ = 100
    vectors_to_search = [np.random.rand(1536).tolist() for _ in range (NQ)]

    In my opinion, to get higher QPS, you'd better generate the random vectors outside the loop. You can pre-crate a list of random vectors before the threads start. Then pick random vectors from the list inside the loop.