regexcpu-wordnon-repetitive

extract (from text) only non-repeating words using only regex via shell terminal


I just want to extract the words that don't repeat over the text below. I just want to use regex, and I have seen some similar questions as in Only extract those words from a list that include no repeating letters, using regex (don't repeat letters) and Regular Expression :match string containing only non repeating words. I would like the result to be a list of words that do not repeat appearing in the natural order in which they occur in the text.

My text in common format:

Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied.

My text in vertical list word for word separately (if it's easier to do using like this) using the answer to this question


Solution

  • If you need a pure regex solution, you can only do that with .NET or Python PyPi regex because you need two things regex libraries do not usually feature: 1) right-to-left input string parsing and 2) infinite width lookbehind.

    Here is a Python solution:

    import regex
    text="Teaching psychology is the part of educational psychology that refers to school education. As will be seen later, both have the same objective: to study, explain and understand the processes of behavioral change that are produce in people as a consequence of their participation in activities educational What gives an entity proper to teaching psychology is the nature and the characteristics of the educational activities that exist at the base of the of behavioral change studied."
    rx = r'(?rus)(?<!\b\1\b.*?)\b(\w+)\b'
    print (list(reversed(regex.findall(rx, text))))
    

    See an online demo.

    Details

    The reversed is used to print the words in the original order, as the right-to-left regex matched them from end to start.