matlabtexttextscan

How can I obtain only word without All Punctuation Marks when I read text file?


The text file abc.txt is an arbitrary article that has been scraped from the web. For example, it is as follows:

His name is "Donald" and he likes burger. On December 11, he married.

I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:

{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}

My code is as follows:

filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));

Is there any simple code to extract only alphabet words and numbers except all symbols?


Solution

  • Two steps are necessary as you want to convert certain words to lowercase.

    regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.

    In the regexprep function, we use the following pattern:

    (?<=^|\. )([A-Z])
    

    to indicate that:

    The ${lower($0)} component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])) to lower case. This syntax is specific to the MATLAB language.

    You can check the behaviour of the above expression here.


    Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.

    The pattern [a-zA-Z0-9]+ matches lower case letters, upper case letters and digits.

    You can check the behavior of this regex here.

    text = fileread('abc.txt')
    data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
    
    >>data{1}
    
    13×1 cell array
    
        {'his'     }
        {'name'    }
        {'is'      }
        {'Donald'  }
        {'and'     }
        {'he'      }
        {'likes'   }
        {'burger'  }
        {'on'      }
        {'December'}
        {'11'      }
        {'he'      }
        {'married' }