regexreplacesaspcre

Regex: substitute spaces after specific word


I’m trying (and failing) to write a regular expression (PCRE2) which will replace every space with a dash (-) after the first instance of a particular word (namely •VAN•, •VON• or •DE•) which itself must be surrounded by spaces.

For example:

HENRIETTA VON DER GRAAF
CAROLINE VAN OOSTEN DE WINKEL
MARC DE VRIES VAN JONG
ANNEKA VANHOVEN BAKKER
JOHN WILKINSON SMITH

would translate to:

HENRIETTA VON-DER-GRAAF
CAROLINE VAN-OOSTEN-DE-WINKEL
MARC DE-VRIES-VAN-JONG
ANNEKA VANHOVEN BAKKER (NB: Does not match VAN as not surrounded by spaces)
JOHN WILKINSON SMITH (NB: No substitution here as pattern not matched)

This is as far as I’ve got, but it’s not substituting all of the spaces following the match:

\b( VON| VAN| DE)+\s

https://regex101.com/r/s6BC1y/1

Any advice most appreciated!


Solution

  • You can do your transformation without regular expressions.

    data have;
    input text $CHAR50.;
    datalines;
    HENRIETTA VON DER GRAAF
    CAROLINE VAN OOSTEN DE WINKEL
    MARC DE VRIES VAN JONG
    ANNEKA VANHOVEN BAKKER
    JOHN WILKINSON SMITH
    ;
    
    data want;
      set have;
      p = prxmatch('m/\b(VAN|VON|DE)( )/',text);
      if 0 < p < length(text) then 
        substr(text,p+1) = translate(substr(trim(text),p+1),'-',' ');
    run;
    

    enter image description here