azureazure-data-lakeazure-purview

Azure Purview - Scan file types


We are scanning an Azure Data Lake (Gen 2). In the scan results we get some files that we don't want to appear in the asset register - for example, a configuration file (.wmk) as per the below. If there any way to hide all files of a certain type? I looked at the scanning rules to see if a custom rule would work and the file type (.wmk) is not listed as a scan target however it does appear in the asset register.

Similar applies to data lake folders, we would only like to see resource sets and not the folders in the assets.

Is there a way to keep them from showing in the assets?

enter image description here


Solution

  • Before scanning you can scope your scan to specific folders or subfolders by choosing the appropriate items in the list. Once the data source is registered and scanned, the Data map extracts information about the structure (hierarchical namespace) of the data source. This information is used to build the browsing experience for data discovery.

    enter image description here

    Note:

    • All future assets under a certain parent will be automatically selected if the parent is fully or partially checked
    • After a successful scan, there may be delay before newly scanned assets appear in the browse experience. This delay may take up to a few hours.

    While searching the catalog for assets, operators can be used to compose a search query.

    Specifically you can use the Boolean operators NOT in all caps to specify what an asset can't contain as a keyword to the right of the clause or use '*' a wildcard that matches on one to many characters so that your query does not return assets that have properties with (.wmk) in them.

    Example: Expense NOT wmk NOT *.wmk
    

    (Operators can be combined as many times as need in a single query.)

    Concept of resource sets:

    To customize or override how Azure Purview detects which assets are grouped as resource sets and how they are displayed within the catalog, you can define pattern rules in the management center.

    Create resource set pattern rules:

    1. Go to the management center. Select Pattern rules from the menu under the Resource sets heading. Select + New to create a new rule set.

    enter image description here

    1. Enter the scope of your resource set pattern rule. (Folder path)
    2. Update the fields appropriately, in your case mainly, Qualified name and Do not group as resource set

    enter image description here

    Note: After a pattern rule is created, all new scans will apply the rule during ingestion. Existing assets in the data catalog will be updated via a background process which can take up to a few hours.

    Example: Don't group .wmk files into resource sets

    Input Files:

    https://myazureblob.blob.core.windows.net/bar/raw/Expense-7/01-01-2020/22:33:22-001.xls
    https://myazureblob.blob.core.windows.net/bar/raw/Expense-8/01-01-2020/22:33:22-002.wmk
    

    Pattern rule

    Scope: https://myazureblob.blob.core.windows.net/bar/
    
    Display name: Expense-{{Fileid}}
    
    Qualified Name: raw/Filename-{{Fileid:int}}/{{:date}}/{{:time}}-{{:int}}.wmk
    
    Resource Set: false
    

    Output individual assets

    Asset 1
    
    Display name: Expense-7
    
    Qualified Name: https://myazureblob.blob.core.windows.net/bar/raw/Expense-7/01-01-2020/22:33:22-001.xls
    

    Additionally, if you feel this is not helpful, you can share your Feedback so the product team can look into this idea. ✌