I want to extract some desirable concepts (noun phrases) in the text automatically. My plan is to extract all noun phrases and then label them as two classifications (i.e., desirable phrases and non-desirable phrases). After that, train a classifier to classify them. What I am trying now is to extract all possible phrases as the training set first. For example, one sentence is Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.
I want to get all phrases like shoulder
, richer mix
, shoulder of richer mix
,junctions
,junctions of columns and beams
, columns and beams
, columns
, beams
or whatever possible. The desirable phrases are shoulder
, junctions
, junctions of columns and beams
. But I don't care the correctness at this step, I just want to get the training set first. Are there available tools for such task?
I tried Rake in rake_nltk, but the results failed to include my desirable phrases (i.e., it did not extract all possible phrases)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
Result: ['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(Missed junctions of columns and beams
here)
I also tried phrasemachine, the results also missed some desirable ones.
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
Result:
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(Missed many noun phrases here)
You may wish to make use of noun_chunks
attribute:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}