Can Python-NLTK recognize input string and parse it not only based on white space but also on the content? Say, "computer system" became a phrases in this situation. Can anyone provide a sample code?
input String: "A survey of user opinion of computer system response time"
Expected output: ["A", "survey", "of", "user", "opinion", "of", "computer system", "response", "time"]
The technology you're looking for is called multiple names from multiple sub-fields or sub-sub-fields of linguistics and computing.
I'll give an example of the NE chunker in NLTK:
>>> from nltk import word_tokenize, ne_chunk, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent)))
>>> for i in chunked:
... print i
...
('A', 'DT')
('survey', 'NN')
('of', 'IN')
('user', 'NN')
('opinion', 'NN')
('of', 'IN')
('computer', 'NN')
('system', 'NN')
('response', 'NN')
('time', 'NN')
With named entities:
>>> sent2 = "Barack Obama meets Michael Jackson in Nihonbashi"
>>> chunked = ne_chunk(pos_tag(word_tokenize(sent2)))
>>> for i in chunked:
... print i
...
(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'NNS')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)
You can see it's pretty much flawed, better something than nothing, i guess.
Terminology Extraction
Here's a few tools
Now back to OP's question.
Q: Can NLTK extract "computer system" as a phrase?
A: Not really
As shown above, NLTK has pre-trained chunker but it works on name entities and even so, not all named entities are well recognized.
Possibly OP could try out more radical idea, let's assume that a sequence of nouns together always form a phrase:
>>> from nltk import word_tokenize, pos_tag
>>> sent = "A survey of user opinion of computer system response time"
>>> tagged = pos_tag(word_tokenize(sent))
>>> chunks = []
>>> current_chunk = []
>>> for word, pos in tagged:
... if pos.startswith('N'):
... current_chunk.append((word,pos))
... else:
... if current_chunk:
... chunks.append(current_chunk)
... current_chunk = []
...
>>> chunks
[[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')], [('survey', 'NN')], [('user', 'NN'), ('opinion', 'NN')]]
>>> for i in chunks:
... print i
...
[('computer', 'NN'), ('system', 'NN'), ('response', 'NN'), ('time', 'NN')]
[('survey', 'NN')]
[('user', 'NN'), ('opinion', 'NN')]
So even with that solution, seems like trying to get 'computer system' alone is hard. But if you think for a bit seems like getting 'computer system response time' is a more valid phrase than 'computer system'.
Do not that all interpretations of computer system response time seem valid:
And many many more possible interpretations. So you've got to ask, what are you using the extracted phrase for and then see how to proceed with cutting long phrases like 'computer system response time'.