azure-data-factoryazure-purview

How does Azure Purview perform Data Lineage in Azure Data Factory when there are multiple Copy Data Activities on the same Source?


My particular scenario is this: Data Factory Pipeline

I have a .txt file in Azure Blob Storage.

  1. I copy this file in Blob to Azure SQLDB
  2. I copy the same file to archive location in the same blob container
  3. I then delete the file after it's archived

When I triggered the Pipeline in Azure Data Factory, Purview gave me a data lineage that only showed the Archive copy activity, and never showed the BLOB to Azure SQLDB activity. Refer to this screenshot for lineage: Purview Data Lineage

When I navigate to the Azure SQLDB destination in Purview, it says no data lineage is available for this asset.

Here is what I have done or thought could be the reason:

  1. Maybe the copy activities need to be in different pipelines. I tested this and same result occurs
  2. Maybe because I deleted the file it's not picking up the Blob source to Azure SQLDB copy activity. I will be testing this, but I think it's unlikely since it did pick up the Blob Source, to Blob Archive Destination copy activity
  3. Maybe it only picks up the last copy activity for a given source, and not all of them. I tested this and it did not change the data lineage. It is possible that I need to do something in Azure Purview to "reset" the data lineage, but I think that it uses the last pipeline run for the source and I noticed it did update the data lineage when I separated the pipeline into 2 pipelines (one for loading Azure SQLDB, and the other for archiving the Blob File)

I'm fairly stuck on this one... I will completely remove the archiving and see what happens, but according to all of the Microsoft Documentation, Data Lineage for Azure Blob and Azure SQLDB is supported, so this should be working. If anyone has answers or ideas, I would love to hear them.

Update** My newest theory is that there is a time lag between when you run a pipeline and the Data Lineage is refreshed in Purview... I am going to try disconnecting the Data Factory and Reconnecting.

Update #2** Deleting the Data Factory connection and reconnecting did nothing from what I can tell. I have been playing with how to delete assets via the REST API, which is apparently the only way to delete assets/relationships in Purview... I think my next step will be to delete the Purview Account and Storage.

Update #3*** Even after spinning up a new Purview account, the lineage still fails to show the Blob to Azure SQLDB. I am wondering if it's because of the for each logic I have, but doesn't make sense because the archive copy activity was in the for each as well. I'm at a loss. I have other Copy activities from Blob to Azure SQLDB that work, but not this one.

Thanks


Solution

  • After a LOT of testing. I believe the problem is Purview does not know how to handle Copy Activities that include additional columns.

    Does NOT work: With additional columns

    enter image description here

    Works: Without additional columns

    enter image description here

    The ONLY difference was the fact one had additional columns mapped, and the other did not. Slight design gap...

    I have created this Azure Purview Feature Request! https://feedback.azure.com/forums/932437-azure-purview/suggestions/42357196-allow-data-lineage-to-be-performed-on-azure-data-f

    Please vote for this so it can be implemented in a future release.