google-cloud-platformmetadatagoogle-data-catalog

Building Google Cloud Platform Data Catalog on unstructured data


I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?


Solution

  • This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.

    I had a similar use case, Google Data Catalog has an option to create custom entries.

    Some tips on building a Data Catalog on unstructured files data:

    1. Use meaningful file names on your JSON files. That way searching for them will become easier.
    2. Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
    3. In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
    4. Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.

    I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.

    In case you are wondering how to do the step 2, I put together one script that automatically does that: enter image description here link for the GitHub. Another option is to work with Data Catalog Filesets.

    So between using custom entries or filesets, I'd ask you this, do you need information about your files name?

    If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.

    The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.