qdrantqdrantclient

How does count correspond with the number of rows upserted?


I initialize a qdrant collection in the following way:

client = QdrantClient(location=":memory:")
my_collection = "my_collection"
client.delete_collection(my_collection)
if not client.collection_exists(my_collection):
  client.create_collection(
      collection_name=my_collection,
      vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
  )

I insert a huggingface dataset in the following way:

def insert_dataset_to_qdrant(dataset_to_process, client):
  np.save("vectors", np.array(dataset_to_process['embeddings']), allow_pickle=False)
  ids = list(range(dataset_to_process.num_rows))
  embeddings = np.load("vectors.npy").tolist()
  payload = dataset_to_process.select_columns([
      'text', 'postcard_id'
  ]).to_pandas().to_dict(orient="records")


  batch_size = 1000


  for i in range(0, dataset_to_process.num_rows, batch_size):


      low_idx = min(i+batch_size, dataset_to_process.num_rows)

      batch_of_ids = ids[i: low_idx]
      batch_of_embs = embeddings[i: low_idx]
      batch_of_payloads = payload[i: low_idx]


      client.upsert(
          collection_name=my_collection,
          points=models.Batch(
              ids=batch_of_ids,
              vectors=batch_of_embs,
              payloads=batch_of_payloads
          )
      )

I then insert several datasets:

dataset1.shape (9778, 5)
dataset2.shape (9678, 4)
dataset3.shape (6118, 4)
dataset4.shape (14314, 4)
dataset5.shape (12084, 4)
dataset6.shape (6202, 4)
dataset7.shape (18994, 4)
dataset8.shape (10760, 4)

but the following code

client.count(
    collection_name=my_collection,
    exact=True
)

gives the following figure: CountResult(count=18994)

Why is it so? I thought that the count result should be the sum of the figures


Solution

  • The problem was with the ids repeating across several batches, which resulted in overwriting a part of the data, which had been previosly loaded