rregexgsubstring-substitution

regex for replacement of specific character outside parenthesis only


I am looking for regex (preferably in R) which can replace (any number of) specific characters say ; with say ;; but only when not present inside parenthesis () inside the text string.

Note: 1. There may be more than one replacement character present inside parenthesis too

2. There are no nested parenthesis in the data/vector

Example

Still if some clarification is needed I will try to explain

in_vec <- c("abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag", "zvc;dfasdf;asdga;asd(asd;hsfd)", "adsg;(asdg;ASF;DFG;ASDF;);sdafdf", "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa")

in_vec
#> [1] "abcd;ghi;dfsF(adffg;adfsasdf);dfg;(asd;fdsg);ag"
#> [2] "zvc;dfasdf;asdga;asd(asd;hsfd)"             
#> [3] "adsg;(asdg;ASF;DFG;ASDF;);sdafdf"           
#> [4] "asagf;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;sdfa"

Expected output (calculated manually)

[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag" 
[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"             
[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"            
[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"

Solution

  • You can use gsub with ;(?![^(]*\\)):

    gsub(";(?![^(]*\\))", ";;", in_vec, perl=TRUE)
    #[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
    #[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
    #[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
    #[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       
    

    ; finds ;, (?!) .. Negative Lookahead (make the replacement when it does not match), [^(] .. everything but not (, * repeat the previous 0 to n times, \\) .. flowed by ).

    Or

    gsub(";(?=[^)]*($|\\())", ";;", in_vec, perl=TRUE)
    #[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
    #[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
    #[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
    #[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       
    

    ; finds ;, (?=) .. Positive Lookahead (make the replacement when it does match), [^)] .. everything but not ), * repeat the previous 0 to n times, ($|\\() .. match end $ or (.

    Or using gregexpr and regmatches extracting the part between ( and ) and making the replacement in the non-matched substrings:

    x <- gregexpr("\\(.*?\\)", in_vec)  #Find the part between ( and )
    mapply(function(a, b) {
      paste(matrix(c(gsub(";", ";;", b), a, ""), 2, byrow=TRUE), collapse = "")
    }, regmatches(in_vec, x), regmatches(in_vec, x, TRUE))
    #[1] "abcd;;ghi;;dfsF(adffg;adfsasdf);;dfg;;(asd;fdsg);;ag"
    #[2] "zvc;;dfasdf;;asdga;;asd(asd;hsfd)"                   
    #[3] "adsg;;(asdg;ASF;DFG;ASDF;);;sdafdf"                  
    #[4] "asagf;;(fafgf;sadg;sdag;a;gddfg;fd)gsfg;;sdfa"       
    

    But all of them will work only for simple open ( close ) combinations.