regexword-boundarykeyboard-maestro

How Can I Create a RegEx Pattern that will Get N Words Using Custom Word Boundary?


I need a RegEx pattern that will return the first N words using a custom word boundary that is the normal RegEx white space (\s) plus punctuation like .,;:!?-*_

EDIT #1: Thanks for all your comments.

To be clear:

  1. I'd like to set the characters that would be the word delimiters
  2. Lets call this the "Delimiter Set", or strDelimiters
  3. strDelimiters = ".,;:!?-*_"
  4. nNumWordsToFind = 5
  5. A word is defined as any contiguous text that does NOT contain any character in strDelimiters
  6. The RegEx word boundary is any contiguous text that contains one or more of the characters in strDelimiters
  7. I'd like to build the RegEx pattern to get/return the first nNumWordsToFind using the strDelimiters.

EDIT #2: Sat, Aug 8, 2015 at 12:49 AM US CT

@maraca definitely answered my question as originally stated. But what I actually need is to return the number of words ≤ nNumWordsToFind. So if the source text has only 3 words, but my RegEx asks for 4 words, I need it to return the 3 words. The answer provided by maraca fails if nNumWordsToFind > number of actual words in the source text.

For example:

one,two;three-four_five.six:seven eight    nine! ten

It would see this as 10 words. If I want the first 5 words, it would return:

one,two;three-four_five.

I have this pattern using the normal \s whitespace, which works, but NOT exactly what I need:

([\w]+\s+){<NumWordsOut>}

where <NumWordsOut> is the number of words to return.

I have also found this word boundary pattern, but I don't know how to use it:

a "real word boundary" that detects the edge between an ASCII letter and a non-letter.

(?i)(?<=^|[^a-z])(?=[a-z])|(?<=[a-z])(?=$|[^a-z])

However, I would want my words to allow numbers as well.

IAC, I have not been able how to use the above custom word boundary pattern to return the first N words of my text.

BTW, I will be using this in a Keyboard Maestro macro.

Can anyone help? TIA.


Solution

  • All you have to do is to adapt your pattern ([\w]+\s+){<NumWordsOut>} to, including some special cases:

    ^[\s.,;:!?*_-]*([^\s.,;:!?*_-]+([\s.,;:!?*_-]+|$)){<NumWordsOut>}
    1.             2.              3.             4.  5.
    
    1. Match any amount of delimiters before the first word
    2. Match a word (= at least one non-delimiter)
    3. The word has to be followed by at least one delimiter
    4. Or it can be at the end of the string (in case no delimiter follows at the end)
    5. Repeat 2. to 4. <NumWordsOut> times

    Note how I changed the order of the -, it has to be at the start or end, otherwise it needs to be escaped: \-.