weaviate

Combine Weaviate databases with different schemas


I currently have a Weaviate database that contains 12 schemas and another which contains 10 schemas. Is it possible to just take the data for the 10 schemas and put them into the other database? If so, what is the best way to do so?


Solution

  • Yes, you can export all collections from one database schema and import them into another Weaviate database. (A bit of terminology: the Weaviate schema contains class definitions, and a collection consists of all objects class of the same class.) Since you haven't mentioned a specific programming language, here's a solution in pseudocode / Python.

    First you need to initialize the clients for the source and target databases, say source_db_client and target_db_client.

    Then you need to get the schema of the source database (the one with 10 collections in your case).

    schema = source_db_client.schema.get()
    

    Then for each class in the schema (for c in schema['classes']),

    1. Fetch the class definition from the schema
    class_def = source_db_client.schema.get(c['class'])  # the class name
    
    1. Recreate the class in the target database
    target_db_client.schema.create_class(class_def)
    
    1. Copy the objects to the target instance using the Cursor API and batching, skipping any cross-references (they require separate handling).

    Python code could look like this:

    import weaviate
    
    source_db_client = weaviate.Client('http://localhost:8080')
    
    target_db_client = weaviate.Client('http://localhost:8081')
    
    batch_size = 100
    
    schema = source_db_client.schema.get()
    
    for c in schema['classes']:
        class_name = c['class']
        class_def = source_db_client.schema.get(class_name)
        target_db_client.schema.create_class(class_def)
    
        # Skip copying cross-reference properties
        class_properties = [prop['name'] for prop in class_def['properties'] if prop['dataType'] not in [['crossReferencedClass1'], ['crossReferencedClass2'], ...]]
        cursor = None
    
        with target_db_client.batch(batch_size=batch_size) as batch:
            # Batch import all objects to the target instance
            while True:
                # From the SOURCE instance, get the next group of objects
                query = (
                    source_db_client.query.get(class_name, class_properties)
                    .with_additional(['id vector'])
                    .with_limit(batch_size)
                )
                if cursor is not None:
                    query = query.with_after(cursor)
                results = query.do()
    
                if 'errors' in results:
                    raise Exception(results['errors'])
    
                # If empty, we're finished
                if len(results['data']['Get'][class_name]) == 0:
                    break
    
                # Otherwise, add the objects to the batch to be added to the target instance
                for retrieved_object in results['data']['Get'][class_name]:
                    new_object = dict()
                    for prop in class_properties:
                        new_object[prop] = retrieved_object[prop]
                    target_db_client.batch.add_data_object(
                        new_object,
                        class_name=class_name,
                        vector=retrieved_object['_additional']['vector']
                    )
    
                # Update the cursor to the id of the last retrieved object
                cursor = results['data']['Get'][class_name][-1]['_additional']['id']
    

    You can find equivalent TypeScript code in the Weaviate documentation under How-to: Manage data -> Read all objects -> Restore to a target instance.