mongodb azure-data-factory uuid azure-cosmosdb-mongoapi azure-data-studio

UUID format is not the same when copying from one Azure Cosmos DB for Mongo DB to another

We are using Azure Data Factory to move a non-prod collection from one Azure Cosmos DB for Mongo DB to another in the same resource group. The are both RU-based resources. I was able to use the "Ingest" template to create a simple pipeline with a single Copy step that was successfully copying documents from the source collection to the desired destination.

Unfortunately, one of the pieces of data is a binary UUID, and the values look different when I view an equivalent document with MongoDB Compass. In the source, I can see the data represented correctly as:

UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1')

but in the migrated collection it's represented as

Binary.createFromBase64('oGI4z7TPyEWmvAAAG2C9oQ==', 3)

How can I migrate the data and preserve the representation of the binary data UUID data?

Viewing the field in question for documents in Mongo DB Compass for the source is like this: MyField : UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1')

looking at the preview within Azure data factory the same field is represented in JSON as: "MyField": { "$binary": "oGI4z7TPyEWmvAAAG2C9oQ==", "$type": "03" },

Yet in the transformed document I see the same field in Compass as: MyField : Binary.createFromBase64('oGI4z7TPyEWmvAAAG2C9oQ==', 3)

Background:

When initially moving data into the source we had C# code using the https://www.mongodb.com/docs/drivers/csharp/v2.19/#mongodb-c--driverthat configured the BSON serializer to achieve the result we want (data in the UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1') format)

BsonSerializer.RegisterSerializer(new GuidSerializer(GuidRepresentation.Standard));
BsonSerializer.RegisterSerializer(new ObjectSerializer(BsonSerializer.LookupDiscriminatorConvention(typeof(object)),GuidRepresentation.Standard));
BsonDefaults.GuidRepresentation = GuidRepresentation.Standard;
BsonDefaults.GuidRepresentationMode = GuidRepresentationMode.V3;

I've explored a number of different options to do this including Azure Data Studio (no-go because it doesn't support RU-based databases), mongodbdump/restore and Azure Data Factory. Azure Data Factory looked simple and promising.

Solution

The legacy UUID format (BSON binary type 3) does not define the byte ordering of the bytes in the stored representation, so it is known that different software displays the UUID differently.

The base64-coded form 'oGI4z7TPyEWmvAAAG2C9oQ==' does not have that ambiguity, so most modern MongoDB drivers will use that form when displaying legacy UUIDs.

To demonstrate why this is a problem using the mongosh shell:

Representing 'cf3862a0-cfb4-45c8-a6bc-00001b60bda1' using legacy UUID, assuming the C# legacy encoding (Note that the byte order is reversed in the first 3 groups but not in the last 2):

primary> Buffer.from("oGI4z7TPyEWmvAAAG2C9oQ==")
<Buffer a0 62 38 cf b4 cf c8 45 a6 bc 00 00 1b 60 bd a1>

Using the modern UUID encoding for the same value (note that the bytes are in the same order as the string value):

primary> UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1').value()
<Buffer cf 38 62 a0 cf b4 45 c8 a6 bc 00 00 1b 60 bd a1>

So the reason is doesn't provide you the UUID()-form for the value is it doesn't know whether you used C#, Java, Python, or standard encoding, and without knowing that, it can't generate the original UUID.

To avoid this problem, use the modern UUID representation (BSON binary type 4), which always uses the standard encoding.

In C# register the serializer to use standard representation before generating UUIDs:

BsonSerializer.RegisterSerializer(new GuidSerializer(GuidRepresentation.Standard));