I initialize a qdrant collection in the following way:
client = QdrantClient(location=":memory:")
my_collection = "my_collection"
client.delete_collection(my_collection)
if not client.collection_exists(my_collection):
client.create_collection(
collection_name=my_collection,
vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
)
I insert a huggingface dataset in the following way:
def insert_dataset_to_qdrant(dataset_to_process, client):
np.save("vectors", np.array(dataset_to_process['embeddings']), allow_pickle=False)
ids = list(range(dataset_to_process.num_rows))
embeddings = np.load("vectors.npy").tolist()
payload = dataset_to_process.select_columns([
'text', 'postcard_id'
]).to_pandas().to_dict(orient="records")
batch_size = 1000
for i in range(0, dataset_to_process.num_rows, batch_size):
low_idx = min(i+batch_size, dataset_to_process.num_rows)
batch_of_ids = ids[i: low_idx]
batch_of_embs = embeddings[i: low_idx]
batch_of_payloads = payload[i: low_idx]
client.upsert(
collection_name=my_collection,
points=models.Batch(
ids=batch_of_ids,
vectors=batch_of_embs,
payloads=batch_of_payloads
)
)
I then insert several datasets:
dataset1.shape (9778, 5)
dataset2.shape (9678, 4)
dataset3.shape (6118, 4)
dataset4.shape (14314, 4)
dataset5.shape (12084, 4)
dataset6.shape (6202, 4)
dataset7.shape (18994, 4)
dataset8.shape (10760, 4)
but the following code
client.count(
collection_name=my_collection,
exact=True
)
gives the following figure: CountResult(count=18994)
Why is it so? I thought that the count result should be the sum of the figures
The problem was with the ids repeating across several batches, which resulted in overwriting a part of the data, which had been previosly loaded