pythondatabasevector-databasemilvus

Multithreaded Rendevous Error - How to fix?


I'm currently running the script below to add a dataset of about 10M vectors into a milvus collection:

import numpy as np
from dtx_data_tools.iterate import batched, map_threaded

times = []
files = [f'agg_dataset/{fp}' for fp in os.listdir('agg_dataset') if 'parquet' in fp]

db_ids_set = set()
counter = 0

for batch_file in files:
    prep_start = time.time()
    df = pd.read_parquet(batch_file).drop_duplicates(subset='scrape_uuid', keep="last")
    
    insert_ids = set(df['scrape_uuid'].tolist())

    new_uuids = insert_ids - db_ids_set
    df = df[df['scrape_uuid'].isin(new_uuids)]
    
    db_ids_set.update(new_uuids)
    
    prep_end  = time.time() - prep_start
    print(f"prep time took {prep_end} seconds")
    
    start = time.time()
    transform_time = time.time()
    df['content_vector'] = df['encoding'].apply(lambda x: np.frombuffer(x, dtype=np.float32))
    df = df[['scrape_uuid', 'content_vector']]
    print(f"Transform time: {time.time() - transform_time} seconds")
    
    try:
        client.insert(
            collection_name="milvus_orb_benchmark",
            data=df.to_dict('records')
        )
    
    counter += 10000
    print(f"Imported {counter} articles..., batch {batch_file} uploaded")
    end = time.time() - start
    print(f"batch insert_time took {end} seconds")
    times.append(end)

The insertion performs well for about the first 40 batches of 10,000, and then all of a sudden hits the error attached below. error image From everything I read online this looks like an issue of connecting to the grpc server, but I'm not sure how exactly to fix it, i've attached our Milvus Operator too, which is pretty barebones.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: milvus
  annotations:
    eks.amazonaws.com/role-arn: xxxxxx
---
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: milvus
  labels:
    app: milvus
spec:
  components:
    serviceAccountName: milvus
  config:
    minio:
      bucketName: xxxxxx
      # enable AssumeRole
      useIAM: true
      useSSL: true
  dependencies:
    storage:
      external: true
      type: S3
      endpoint:xxxxxxxx
      secretRef: ""

Solution

  • “connection attempt timed out before receiving SETTINGS frame”

    It might be a known issue of grpcio lib: grpc/grpc#36256

    Read the comment in this issue. Check the version of your grpcio, if the version is between 1.58 and 1.62, try this workaround:

    I just tried grpcio 1.60 and inserted vectors in batches of 10000, but not able to reproduce the error.

    I didn't see multithread is used in your script, per the "Multithreaded Rendevous Error" do you mean insert batches by multithread?