The text file abc.txt
is an arbitrary article that has been scraped from the web. For example, it is as follows:
His name is "Donald" and he likes burger. On December 11, he married.
I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:
{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}
My code is as follows:
filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));
Is there any simple code to extract only alphabet words and numbers except all symbols?
Two steps are necessary as you want to convert certain words to lowercase.
regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.
In the regexprep
function, we use the following pattern:
(?<=^|\. )([A-Z])
to indicate that:
(?<=^|\. )
We want to assert that before the word of interest either the start of string (^
), or (|
) a full stop (.
) followed by whitespace are found. This type of construct is called a lookbehind.([A-Z])
This part of the expression matches and captures (stores the match) a upper case letter (A-Z
).The ${lower($0)}
component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])
) to lower case. This syntax is specific to the MATLAB language.
You can check the behaviour of the above expression here.
Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.
The pattern [a-zA-Z0-9]+
matches lower case letters, upper case letters and digits.
You can check the behavior of this regex here.
text = fileread('abc.txt')
data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
>>data{1}
13×1 cell array
{'his' }
{'name' }
{'is' }
{'Donald' }
{'and' }
{'he' }
{'likes' }
{'burger' }
{'on' }
{'December'}
{'11' }
{'he' }
{'married' }