Background
I have one million data, the vector dimension is 1536, and I hope to use GPU to speed up vector query and search
Resource Information
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 104
On-line CPU(s) list: 0-103
Thread(s) per core: 2
Core(s) per socket: 26
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz
Stepping: 6
CPU MHz: 2800.167
BogoMIPS: 4400.00
Virtualization: VT-x
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:17:00.0 Off | 0 |
| N/A 40C P0 64W / 300W | 1MiB / 81920MiB | 1% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 42C P0 69W / 300W | 1MiB / 81920MiB | 3% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
My steps
from pymilvus import MilvusClient, DataType
import time
import numpy as np
import string
import random
milvus_uri = ""
collection_name = ""
client = MilvusClient(uri=milvus_uri)
client.load_collection(collection_name)
search_params = {
"metric_type": "L2",
"params": {"nprobe": 32},
}
vectors_to_search = [np.random.rand(1536).tolist() ]
start_time = time.time()
result = client.search(
collection_name=collection_name,
data=vectors_to_search,
filter = filter_expr,
anns_field="embeddings",
search_params=search_params,
limit=10,
output_fields=["random"],
consistency_level="Eventually"
)
end_time = time.time()
print(f"time cost {end_time-start_time}")
GPU IVF FLAT: nprobe: 32
Concurrency | QPS |
---|---|
1 | 681 |
5 | 594 |
10 | 546 |
CPU IVF FLAT: nprobe: 32
Concurrency | QPS |
---|---|
1 | 680 |
5 | 609 |
10 | 580 |
My question:
Why does the GPU not have an acceleration effect? Please help me see if there is anything wrong with the above operation.
For CPU index, the time cost of a search request includes the following:
For GPU index, the time cost of a search request includes the following:
We can see there are extra time costs for copying data between CPU memory and GPU memory.
The advantage of GPU search is large NQ search because GPU has strong parallel computing ability.
For small datasets and small NQ searches, it is no much difference between CPU index and GPU index.
A large NQ search is like this:
NQ = 10000
target_vectors = []
for i in range(NQ):
target_vectors.append(random_vector)
results = collection.search(
data=target_vectors,
anns_field="xxx",
param=search_params,
limit=topk,
consistency_level="Eventually",
)
You can try increasing the "Concurrency", or higher NQ value for each request.
NQ = 100
vectors_to_search = [np.random.rand(1536).tolist() for _ in range (NQ)]
In my opinion, to get higher QPS, you'd better generate the random vectors outside the loop. You can pre-crate a list of random vectors before the threads start. Then pick random vectors from the list inside the loop.