I have 2 words, let's say computer
and tool
.
Computer
is a concrete noun whereas tool
is relatively abstract.
I want to get level of abstractness of each word that will reflect this.
I thought the best way to do it is by counting number of hyper/hypo nyms for each word.
computer
would you refer to?In WordNet, a word has different "concepts", aka synsets:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('computer')
[Synset('computer.n.01'), Synset('calculator.n.01')]
>>> wn.synsets('computer')[0].definition()
'a machine for performing calculations automatically'
>>> wn.synsets('computer')[1].definition()
'an expert at calculation (or at operating calculating machines)'
computer
The hyper/hyponyms are concepts, i.e. synsets too, so it's not connected to the form/word but to the possible synsets that might be represented by the word computer
, i.e.
>>> type(wn.synsets('computer')[0])
<class 'nltk.corpus.reader.wordnet.Synset'>
>>> wn.synsets('computer')[0].hypernyms()
[Synset('machine.n.01')]
>>> wn.synsets('computer')[0].hyponyms()
[Synset('analog_computer.n.01'), Synset('digital_computer.n.01'), Synset('home_computer.n.01'), Synset('node.n.08'), Synset('number_cruncher.n.02'), Synset('pari-mutuel_machine.n.01'), Synset('predictor.n.03'), Synset('server.n.03'), Synset('turing_machine.n.01'), Synset('web_site.n.01')]
According to the definition, should words have hyper/hyponyms? Or should concept have hypo/hypernyms?
Okay, then we have to make some assumption.
Lets consider all synsets of a word accessed through the WordNet as a "holistic" concept of any word form
We consider the sum of all DIRECT hyper-/hyponyms of all synsets of a given word
Based on the number of hyper-/hyponyms of all synsets that can be represented by a certain word form, we deduce that word X
is more/less abstract than word Y
>>> hypernym_count = lambda word: sum(len(ss.hypernyms()) for ss in wn.synsets(word))
>>> hyponym_count = lambda word: sum(len(ss.hyponyms()) for ss in wn.synsets(word))
>>> hyponym_count('computer')
14
>>> hypernym_count('computer')
2
>>> hypernym_count('tool')
8
>>> hyponym_count('tool')
32
Since (3) is your hypothesis that you want to test, you should be the one deciding what heuristics to deduce if a word is more/less abstract based on the hyponym_count
and hypernym_count
results
DIRECT
hyper-/hyponyms?We're only accessing the hyper-/hyponyms one level above/below the synset. That's what "direct" means here.
Then how to get all the hyponyms below a synset, see https://stackoverflow.com/a/42012001/610569
That's for you to find out and tell us =) Have fun!