matlab

How to extract all the nouns from a tokenized document?


I'm trying to extract all the nouns from a tokenized document and select the top 3. It's not working, I suspect because I am not using the strcmp command correctly. This is my code

sT2 = tokenizedDocument([
    "a strongly worded collection of words and letters"
    "another collection of words"]);

tD = tokenizedDocument(sT2);

tD = addPartOfSpeechDetails(tD);

tdetails = tokenDetails(tD);

td7 = table2cell(tdetails(:,7)); % PARTS OF SPEECH

siztd7 = size(td7);

cc = 1;

for ii = 1:siztd7

    if strcmp(td7(ii,1), 'noun') == 1

        tDNoun(cc) = tdetails(1,:);

        cc = cc + 1;

    end

end

bag = bagOfWords(tDNoun);

tb100 = topkwords(bag,3)

Solution

  • The variable tdetails is a MATLAB table, and you can extract the nouns directly from that using table indexing, like this:

    nouns = tdetails{tdetails.PartOfSpeech == "noun", "Token"}
    

    The first subscript matches the table variable PartOfSpeech against "noun", and the second subscript extracts only the table variable "Token". The use of brace indexing, i.e. {} extracts the data - in this case a string array of the words.

    This can then be used directly with bagOfWords, although we must transpose the array nouns to get a row vector as required by that function:

    bag = bagOfWords(nouns')
    topkwords(bag, 3)