I'm currently running the script below to add a dataset of about 10M vectors into a milvus collection:
import numpy as np
from dtx_data_tools.iterate import batched, map_threaded
times = []
files = [f'agg_dataset/{fp}' for fp in os.listdir('agg_dataset') if 'parquet' in fp]
db_ids_set = set()
counter = 0
for batch_file in files:
prep_start = time.time()
df = pd.read_parquet(batch_file).drop_duplicates(subset='scrape_uuid', keep="last")
insert_ids = set(df['scrape_uuid'].tolist())
new_uuids = insert_ids - db_ids_set
df = df[df['scrape_uuid'].isin(new_uuids)]
db_ids_set.update(new_uuids)
prep_end = time.time() - prep_start
print(f"prep time took {prep_end} seconds")
start = time.time()
transform_time = time.time()
df['content_vector'] = df['encoding'].apply(lambda x: np.frombuffer(x, dtype=np.float32))
df = df[['scrape_uuid', 'content_vector']]
print(f"Transform time: {time.time() - transform_time} seconds")
try:
client.insert(
collection_name="milvus_orb_benchmark",
data=df.to_dict('records')
)
counter += 10000
print(f"Imported {counter} articles..., batch {batch_file} uploaded")
end = time.time() - start
print(f"batch insert_time took {end} seconds")
times.append(end)
The insertion performs well for about the first 40 batches of 10,000, and then all of a sudden hits the error attached below. From everything I read online this looks like an issue of connecting to the grpc server, but I'm not sure how exactly to fix it, i've attached our Milvus Operator too, which is pretty barebones.
apiVersion: v1
kind: ServiceAccount
metadata:
name: milvus
annotations:
eks.amazonaws.com/role-arn: xxxxxx
---
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: milvus
labels:
app: milvus
spec:
components:
serviceAccountName: milvus
config:
minio:
bucketName: xxxxxx
# enable AssumeRole
useIAM: true
useSSL: true
dependencies:
storage:
external: true
type: S3
endpoint:xxxxxxxx
secretRef: ""
“connection attempt timed out before receiving SETTINGS frame”
It might be a known issue of grpcio lib: grpc/grpc#36256
Read the comment in this issue. Check the version of your grpcio, if the version is between 1.58 and 1.62, try this workaround:
I just tried grpcio 1.60 and inserted vectors in batches of 10000, but not able to reproduce the error.
I didn't see multithread is used in your script, per the "Multithreaded Rendevous Error" do you mean insert batches by multithread?