regextcl

Regular expression with capturing groups at each end but treated as 'normal' in the middle of a string?


There are a number of scenarios involving text within []s. The words 'front', 'middle', 'end', 'original' can represent a single or multiple words; although all of these are relatively short phrase of a total of less than ten words combined.

  1. '[front] original'
  2. 'original [end]'
  3. '[front] original [end]'
  4. '[front] orig [middle] inal [end]'
  5. '[front] orig [middle] inal'
  6. 'orig [middle] inal [end]'

The desired results written to separate variable are:

  1. [front], original, {}
  2. {}, original, [end]
  3. [front], original, [end]
  4. [front], orig [middle] inal, [end]
  5. [front], orig [middle] inal {}
  6. {}, orig [middle] inal, [end]

Thus far, can only get it to split for a [front] only (1) and an [end] only (2) but none of the other scenarios. (I'm always quite dense with by regular expressions.)

[regexp {^(\[.*\])(.*)} $english {\1\2} addedEng origEng]
set addition 1
set addedEng [string trim $addedEng]
set origEng [string trim $origEng]
# input: {[and now] is the time}
# output: 
# addition 1
# addEng [and now]
# origEng is the time
[regexp {(.*)(\[.*\])$} $english {\1\2} origEng addedEng]
set addition 2
set addedEng [string trim $addedEng]
set origEng [string trim $origEng]
# input: {and the time [now is]}
# output:
# addition 2
# addEng [now is]
# origEng and the time

Both of these fail to separate the text if there are braces at both ends or the middle.

I'd like to have three variables, such as addedFront, origEng, addedEnd and have them populated or empty if no []s at that position.

Thank you for any guidance you may be able to provide.


Here are examples of each of the above scenarios (apologize for the sample phrases) using the expression by @WiktorStribiżew. Appears to work exactly as desired. The commented row shows how far off I've been in my feeble attempts. Thank you.

proc parse {} {
upvar 1 english english
# regexp {^(\[.*\])?(.*)(\[.*\])?$} $english {\1\2\3} addedFront origEng addedEnd
regexp {^(\[[^][]*\])?(.*?)(\[[^][]*\])?$} $english - addedFront origEng addedEnd
chan puts "$addedFront -- $origEng -- $addedEnd"
}
set phrases [list\
{[and now] is the time}\
{and the time [now is]}\
{[now is] the time [age]}\
{[now is] the time [age] of trouble [burdens]}\
{the time [age] of trouble [burdens]}\
{[now is] the time [age] of trouble}\
]
foreach english $phrases { parse }
# [and now] --  is the time -- 
#  -- and the time  -- [now is]
# [now is] --  the time  -- [age]
# [now is] --  the time [age] of trouble  -- [burdens]
#  -- the time [age] of trouble  -- [burdens]
# [now is] --  the time [age] of trouble -- 

Something that I had not thought of occurring is punctuation at the []s. Not sure where it should really go--with the non-braced text or with the braced text to which it immediately follows; but chose the latter; and this appears to work.

proc parse {} {
upvar 1 english english
# regexp {^(\[.*\])?(.*)(\[.*\])?$} $english {\1\2\3} addedFront origEng addedEnd
# regexp {^(\[[^][]*\])?(.*?)(\[[^][]*\])?$} $english {\1\2\3} addedFront origEng addedEnd
# regexp {^(\[[^][]*\])?(.*?)(\[[^][]*\])?$} $english - addedFront origEng addedEnd
regexp {^(\[[^][]*\][\.|\?|,|!|;|:]?)?(.*?)(\[[^][]*\][\.|\?|,|!|;|:]?)?$} $english {\1\2\3} addedFront origEng addedEnd
chan puts -nonewline "$english  =>   "
chan puts "$addedFront -- $origEng -- $addedEnd"
}
set phrases [list\
{[and now] is the time}\
{and the time [now is]}\
{[now is] the time [age]}\
{[now is] the time [age] of trouble [burdens]}\
{the time [age] of trouble [burdens]}\
{[now is] the time [age] of trouble}\
{[added word]}\
{}\
{[and now], is the time}\
{and the time [now is];}\
{[now is]; the time [age]}\
{[now is]; the time [age] of trouble [burdens]?}\
{the time [age]: of trouble [burdens]?}\
{[now is]! the time [age]? of trouble!}\
{[added word,]!}\
{}
]
foreach english $phrases { parse }

# [and now] is the time  =>   [and now] --  is the time -- 
# and the time [now is]  =>    -- and the time  -- [now is]
# [now is] the time [age]  =>   [now is] --  the time  -- [age]
# [now is] the time [age] of trouble [burdens]  =>   [now is] --  the time [age] of trouble  -- [burdens]
# the time [age] of trouble [burdens]  =>    -- the time [age] of trouble  -- [burdens]
# [now is] the time [age] of trouble  =>   [now is] --  the time [age] of trouble -- 
# [added word]  =>   [added word] --  -- 
#   =>    --  -- 
# [and now], is the time  =>   [and now], --  is the time -- 
# and the time [now is];  =>    -- and the time  -- [now is];
# [now is]; the time [age]  =>   [now is]; --  the time  -- [age]
# [now is]; the time [age] of trouble [burdens]?  =>   [now is]; --  the time [age] of trouble  -- [burdens]?
# the time [age]: of trouble [burdens]?  =>    -- the time [age]: of trouble  -- [burdens]?
# [now is]! the time [age]? of trouble!  =>   [now is]! --  the time [age]? of trouble! -- 
# [added word,]!  =>   [added word,]! --  -- 
#   =>    --  -- 


Solution

  • You can use

    ^(\[[^][]*\])?(.*?)(\[[^][]*\])?$
    

    See the Tcl demo

    The regex matches