pythonapache-sparkpysparkbigdatamilvus

How to properly optimize Spark and Milvus to handle big data?


I have a spark dataframe of 2 columns: id and vector.

The vector column is a list of floats 20,000 elements long.

Dataframe itself is 2,500,000 rows long.

I use Spark-Milvus connector to insert the data as I tried variety of ways formatting small pieces of data and trying to insert it into Milvus collection to no avail.

When I create a collection in Milvus and try to insert a batch of 200,0000 rows from spark dataframe it takes more than 10 minutes and sometimes crashes.

Loading Milvus collection of 200,000 records takes more than 1h and is never ended.

Once the batch is inserted it takes nearly 10 minutes to assign indices to vector column.

I wonder if there's a general practice on dealing with big batches on how to optimize the processing time to insert and index.

What setup of Spark and Milvus should I use to achieve best performance possible?

And how to properly transform data before inserting into Milvus collection? Should data be presented as numpy array or any other format?

A random row when collected from my spark dataframe would look like this: [1005, [0.01, ..., 0.78] where 1005 is an id and a list of floats is a 20,000 long vector.

Here's my spark setup:

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("collab_filter_test_on_local") \
    .config("spark.driver.extraClassPath", '/data/notebook_files/clickhouse-native-jdbc-shaded-2.6.5.jar') \
    .config("spark.jars", "/data/notebook_files/spark-milvus-1.0.0-SNAPSHOT.jar") \
    .config("spark.driver.memory", "16g") \
    .getOrCreate()

Here's my Milvus setup:

connections.connect(alias="default", host="localhost", port=19530)

fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name='vec', dtype=DataType.FLOAT_VECTOR, dim=dim_size)
]

schema = CollectionSchema(fields, 'data')

data = Collection('data', schema)

Solution

  • I'm the developer of Spark-milvus connector. Glad to hear you are using it. Please create issue on github so that we can reply you more timel.

    For your question:

    1, 20000 dim vector is quite large, it will obviously takes more resource in all steps. Please consider whether it is necessary and whether we can reduce the dim

    2, Insert 200,0000 rows in 10 mins ~ about 3k/s. Actually it is not too bad. Please upload the error msg to github when you get crash.