We have data from a REST API call stored in a output file that looks as follows:
Sample Input File:
test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
Expected Output:
bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh
Conditions to Derive the Expected Output:
test test123 - test (bla bla1 (On chutti))
last parenthesis starts from (bla
to till chutti))
so I need bla bla1
since its before inner (On chutti)
. So look for last parenthesis and then inside how many pair of parenthesis comes we need to get data before them, eg: in line testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))
needed is Ranbir kapoor
and Milkha Singh
.Attempted Regex: I tried using the following regular expression on Working Demo of regex:
Regex:
^(?:^[^(]+\([^)]+\) \(([^(]+)\([^)]+\)\))|[^(]+\(([^(]+)\([^)]+\),\s([^\(]+)\([^)]+\)\s\([^\)]+\)\)|(?:(?:.*?)\((.*?)\(.*?\)\))|(?:[^(]+\(([^)]+)\))$
Regex I have tried is working fine but I want to improve it with advise of experts here.
Preferred Languages: Looking for improving this regex OR a python, awk answer is also ok. I myself will also try to add an awk
answer.
Purely based on your shown input and your comments reflecting that you need to capture 1 or 2 values per line, here is an optimized regex solution:
^(?:\([^)(]*\)|[^()])*\(([^)(]+)(?:\([^)(]*\)[, ]*(?:([^)(]+))?)?
RegEx Details:
This regex solution does the following:
Further Details:
^
: Start(?:
: Start non-capture group
\([^\n)(]*\)
: Match any pair of (...)
text|
: OR[^()\n]
: Match any character that are not (
, )
and \n
)*
: End non-capture group. Repeat this 0 or more times\(
: Match last (
([^)(\n]+)
: 1st capture group that matches text with 1+ characters that are not (
, )
and \n
(?:
: Start non-capture group 1
\([^\n)(]*\)
: Match any pair of (...)
text[, ]*
: Match 0 or more of space or comma characters(?:
: Start non-capture group 2
([^)(\n]+)
: 2nd capture group that matches text with 1+ characters that are not (
, )
and \n
)?
: End non-capture group 2. ?
makes this an optional match)?
: End non-capture group 1. ?
makes this an optional match