I just want to extract the words that don't repeat over the text below. I just want to use regex, and I have seen some similar questions as in Only extract those words from a list that include no repeating letters, using regex (don't repeat letters) and Regular Expression :match string containing only non repeating words. I would like the result to be a list of words that do not repeat appearing in the natural order in which they occur in the text.
My text in common format:
Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied.
My text in vertical list word for word separately (if it's easier to do using like this) using the answer to this question
If you need a pure regex solution, you can only do that with .NET or Python PyPi regex because you need two things regex libraries do not usually feature: 1) right-to-left input string parsing and 2) infinite width lookbehind.
Here is a Python solution:
import regex
text="Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied."
rx = r'(?rus)(?<!\b\1\b.*?)\b(\w+)\b'
print (list(reversed(regex.findall(rx, text))))
See an online demo.
Details
(?rus)
- r
enables right-to-left input string parsing (all patterns in the regular expression match left to right as usual, so the match texts are not reversed), u
in Python 2 is used to make \w
Unicode aware, it is the default option in Python 3, s
is the DOTALL modifier making .
match line breaks(?<!\b\1\b.*?)
- no match if immediately to the left of the current location, there are any 0+ chars and then the same text as is captured in Group 1 (see later in the expression) as whole word\b(\w+)\b
- a whole word, 1+ word chars within word boundaries.The reversed
is used to print the words in the original order, as the right-to-left regex matched them from end to start.