vector-databasemilvus

Slow insert and bulk_insert seems to fail on milvus standalone


We have a milvus collection of about 351M rows on databricks. We are hosting our data on a single ec2 instance standalone milvus container (r5.8xlarge). Our collection has 380 partitions and only have an index on our sparse float vector column.

When writing our data into milvus, we broke down our table to 2000 files on databricks and sequentially doing stream inserts;

client.insert(collection_name, data)

Is there a faster way to do this without leveraging kubernetes? We know that there is a bulk insert but it does not seem to be a much faster solution, in fact it fails for us in the middle of bulk_insert , this is after preparing data in S3 or MinIO. Is bulk_insert only optimized for kubernetes? Is bulk_insert equivalent to regular inserts if on our database is hosted on a single instance?

Currently, doing stream inserts of 2000 files takes about a minute per file so 2000 minutes totals to around 33 hours... Trying to insert too large of a file causes failures, I believe inserts are limited to 64mb but I have been inserting files that are larger ranging from 70-90mb.

p.s. We have also explored spark-milvus connector but it is limited in capabilities and I believe it does not yet support certain data types that we are using in our schema. https://github.com/zilliztech/spark-milvus/issues/18


Solution

  • From my experience, Milvus offers two main methods for data ingestion: insert() and bulkinsert(). The insert() method works via an RPC channel from the SDK client to the Milvus server, passing through Pulsar/Kafka before persisting in S3. Each insert request is limited to 64MB. On the other hand, bulkinsert() accepts relative paths of S3 files. The server instructs data nodes to read files from S3 asynchronously, constructing segments and building indexes. This method can handle files up to 16GB and supports formats like JSON, Numpy, and Parquet, with Parquet recommended.

    However, I think bulkinsert() interface is not too user-friendly. I would recommend using Milvus provided tools like bulkwriter to help convert data into the correct format, upload it to S3, and call bulkinsert. This tool manages data in a memory buffer and flushes it to a data file once it exceeds a threshold, typically 128MB. There are also LocalBulkWriter and RemoteBulkWriter classes to handle local file generation and S3 uploads, respectively.