pythonfreebasegoogle-cloud-visiongoogle-knowledge-graph

Determine Categorical Hierarchy Level of Freebase MID Value


After using the Google Cloud Vision API, I received MID values in the format of /m/XXXXXXX (not necessarily 7 characters at the end though). What I would like to do is determine how specific one MID value is compared to the others. Essentially how broad vs. refined a term is. For example, the term Vehicle might be level 1 while the term Van might be level 2.

I have tried to run the MID values through the Google Knowledge Graph API but unfortunately these MIDs are not in that database and return no information. For example, a few MIDs and descriptions I have are as follows:

/m/07s6nbt = text
/m/03gq5hm = font
/m/01n5jq = poster
/m/067408 = album cover

My initial thought on why these MIDs return nothing in the Knowledge Graph API is that they were not carried over after the discontinuation of Freebase. I understand that Google provides an RDF dump of Freebase but I'm not sure how to read that data in Python and use it to determine the depth of a mid in the hierarchy.

If it's not possible to determine the category level of the MID value, the number of connections a term had would also be an appropriate proxy. Assuming broader terms have more connections to other terms than more refined terms. I found an article that discusses the amount of "edges" a MID has which I believe means the number of connections. However, they do some converting between MID values to Long Values and use various scripts that keep giving me numerous errors in Python. I was hoping for a simple table with MID values in one column and the number of connections in another but I'm lost in their code, converting values, and Python Errors.

If you have any suggestions for easily determining the amount of connections a MID has or its hierarchical level, it would be greatly appreciated. Thank you!


Solution

  • Those MIDs look like they're for pretty common things, so I'm surprised their not in the Knowledge Graph. Do you prefix the MIDs to form URIs?

    "kg": "http://g.co/kg"
    "kg:/m/067408"
    

    Freebase and the Knowledge Graph aren't organized as hierarchies, so your level finding idea doesn't really work. I'm also dubious about your idea of degree (ie # of edges) being correlated with broader vs narrower, but you should be able to use the dump that you've found to test it.

    The Freebase ExQ Data Dump that you found is super confusing because they rename Freebase types as topics (not to be confused with Freebase topics), but I think their freebase-nodes-in-out-name.tsv contains information that you're looking for (# of edges == degree). You can use either the inDegree, outDegree or the sum of the two.

    Their MID to integer conversion code doesn't look right to me (and doesn't match the comments) but you'll need to use a compatible implementation to match up with what they've done.

    Looking at

    /m/02w0000  "Clibadium subsessilifolium"@en
    

    it's encoded as

    48484848875048
    

    or

    48 48 48 48 87 50 48
     0  0  0  0  w  2  0
    

    So, just take the ASCII values from right to left and concatenate them left to right. Confusing, inefficient, and wrong all in one! (It's actually a base 36 (or 37?) coding)