We are using Azure Data Factory to move a non-prod collection from one Azure Cosmos DB for Mongo DB to another in the same resource group. The are both RU-based resources. I was able to use the "Ingest" template to create a simple pipeline with a single Copy step that was successfully copying documents from the source collection to the desired destination.
Unfortunately, one of the pieces of data is a binary UUID, and the values look different when I view an equivalent document with MongoDB Compass. In the source, I can see the data represented correctly as:
UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1')
but in the migrated collection it's represented as
Binary.createFromBase64('oGI4z7TPyEWmvAAAG2C9oQ==', 3)
How can I migrate the data and preserve the representation of the binary data UUID data?
Viewing the field in question for documents in Mongo DB Compass for the source is like this: MyField : UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1')
looking at the preview within Azure data factory the same field is represented in JSON as: "MyField": { "$binary": "oGI4z7TPyEWmvAAAG2C9oQ==", "$type": "03" },
Yet in the transformed document I see the same field in Compass as: MyField : Binary.createFromBase64('oGI4z7TPyEWmvAAAG2C9oQ==', 3)
Background:
BsonSerializer.RegisterSerializer(new GuidSerializer(GuidRepresentation.Standard));
BsonSerializer.RegisterSerializer(new ObjectSerializer(BsonSerializer.LookupDiscriminatorConvention(typeof(object)),GuidRepresentation.Standard));
BsonDefaults.GuidRepresentation = GuidRepresentation.Standard;
BsonDefaults.GuidRepresentationMode = GuidRepresentationMode.V3;
The legacy UUID format (BSON binary type 3) does not define the byte ordering of the bytes in the stored representation, so it is known that different software displays the UUID differently.
The base64-coded form 'oGI4z7TPyEWmvAAAG2C9oQ==' does not have that ambiguity, so most modern MongoDB drivers will use that form when displaying legacy UUIDs.
To demonstrate why this is a problem using the mongosh shell:
Representing 'cf3862a0-cfb4-45c8-a6bc-00001b60bda1' using legacy UUID, assuming the C# legacy encoding (Note that the byte order is reversed in the first 3 groups but not in the last 2):
primary> Buffer.from("oGI4z7TPyEWmvAAAG2C9oQ==")
<Buffer a0 62 38 cf b4 cf c8 45 a6 bc 00 00 1b 60 bd a1>
Using the modern UUID encoding for the same value (note that the bytes are in the same order as the string value):
primary> UUID('cf3862a0-cfb4-45c8-a6bc-00001b60bda1').value()
<Buffer cf 38 62 a0 cf b4 45 c8 a6 bc 00 00 1b 60 bd a1>
So the reason is doesn't provide you the UUID()
-form for the value is it doesn't know whether you used C#, Java, Python, or standard encoding, and without knowing that, it can't generate the original UUID.
To avoid this problem, use the modern UUID representation (BSON binary type 4), which always uses the standard encoding.
In C# register the serializer to use standard representation before generating UUIDs:
BsonSerializer.RegisterSerializer(new GuidSerializer(GuidRepresentation.Standard));