I am attempting to compute Hamming distance with the DataType.BINARY_VECTOR
in Milvus. However, when I perform the last step doing client.search()
, I encountered an error when I tried to searched with binary vectors.
I have the code attached below.
from pymilvus import MilvusClient, DataType
from pathlib import Path
import numpy as np
DB_FILE = 'demo.db'
DIM = 4096
COLLECTION_NAME = 'dim_reduction'
METRIC_TYPE = 'HAMMING'
INDEX_TYPE = 'BIN_FLAT'
DATATYPE = DataType.BINARY_VECTOR
DTYPE = np.bool_
# Remove DB_FILE if exists
db_path = Path(DB_FILE)
if db_path.exists():
db_path.unlink()
# Build client
client = MilvusClient(DB_FILE)
# Create schema
schema = MilvusClient.create_schema()
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name='text', datatype=DataType.VARCHAR, max_length=1024)
schema.add_field(field_name='vector', datatype=DATATYPE, dim=DIM)
# Create collection
client.create_collection(
collection_name=COLLECTION_NAME,
schema=schema,
)
# Insert data
data = [
{'id': i, 'vector': np.array([1] * DIM, dtype=DTYPE), 'text': f'doc {i}'}
for i in range(100)
]
client.insert(collection_name=COLLECTION_NAME, data=data)
# Create index
index_params = MilvusClient.prepare_index_params()
index_params.add_index(
field_name='vector',
metric_type=METRIC_TYPE,
index_type=INDEX_TYPE,
)
client.create_index(
collection_name=COLLECTION_NAME,
index_params=index_params,
)
# Search
search_params = {
'metric_type': METRIC_TYPE,
'params': {},
}
result = client.search(
collection_name=COLLECTION_NAME,
data=[np.array([1] * DIM, dtype=DTYPE)],
limit=2,
search_params=search_params,
)
print(result)
Can someone please take a look at this for me? Thanks a lot!
I believe the error appears because binary vectors need to be in the form of byte arrays. I found this in one of the Milvus examples that might be helpful to you (link: https://github.com/milvus-io/pymilvus/blob/f7a4839a8a6b05620985d25cde47b63247a561e7/examples/binary_example.py#L23):
def gen_binary_vectors(num, dim):
raw_vectors = []
binary_vectors = []
for _ in range(num):
raw_vector = [random.randint(0, 1) for _ in range(dim)]
raw_vectors.append(raw_vector)
# packs a binary-valued array into bits in a unit8 array, and bytes array_of_ints
binary_vectors.append(bytes(np.packbits(raw_vector, axis=-1).tolist()))
return raw_vectors, binary_vectors