We are scanning an Azure Data Lake (Gen 2). In the scan results we get some files that we don't want to appear in the asset register - for example, a configuration file (.wmk) as per the below. If there any way to hide all files of a certain type? I looked at the scanning rules to see if a custom rule would work and the file type (.wmk) is not listed as a scan target however it does appear in the asset register.
Similar applies to data lake folders, we would only like to see resource sets and not the folders in the assets.
Is there a way to keep them from showing in the assets?
Before scanning you can scope your scan to specific folders or subfolders by choosing the appropriate items in the list. Once the data source is registered and scanned, the Data map extracts information about the structure (hierarchical namespace) of the data source. This information is used to build the browsing experience for data discovery.
Note:
- All future assets under a certain parent will be automatically selected if the parent is fully or partially checked
- After a successful scan, there may be delay before newly scanned assets appear in the browse experience. This delay may take up to a few hours.
While searching the catalog for assets, operators can be used to compose a search query.
Specifically you can use the Boolean operators NOT in all caps to specify what an asset can't contain as a keyword to the right of the clause or use '*' a wildcard that matches on one to many characters so that your query does not return assets that have properties with (.wmk) in them.
Example: Expense NOT wmk NOT *.wmk
(Operators can be combined as many times as need in a single query.)
Concept of resource sets:
To customize or override how Azure Purview detects which assets are grouped as resource sets and how they are displayed within the catalog, you can define pattern rules in the management center.
Create resource set pattern rules:
Note: After a pattern rule is created, all new scans will apply the rule during ingestion. Existing assets in the data catalog will be updated via a background process which can take up to a few hours.
Example: Don't group .wmk files into resource sets
Input Files:
https://myazureblob.blob.core.windows.net/bar/raw/Expense-7/01-01-2020/22:33:22-001.xls
https://myazureblob.blob.core.windows.net/bar/raw/Expense-8/01-01-2020/22:33:22-002.wmk
Pattern rule
Scope: https://myazureblob.blob.core.windows.net/bar/
Display name: Expense-{{Fileid}}
Qualified Name: raw/Filename-{{Fileid:int}}/{{:date}}/{{:time}}-{{:int}}.wmk
Resource Set: false
Output individual assets
Asset 1
Display name: Expense-7
Qualified Name: https://myazureblob.blob.core.windows.net/bar/raw/Expense-7/01-01-2020/22:33:22-001.xls
Additionally, if you feel this is not helpful, you can share your Feedback so the product team can look into this idea. ✌