pythonregexawk

Handling complex parentheses structures to get the expected data


We have data from a REST API call stored in an output file that looks as follows:

Sample Input File:

test test123 - test (bla bla1 (On chutti))
test test123 bla12 teeee (Rinku Singh)
balle balle (testagain) (Rohit Sharma)
test test123 test1111 test45345 (Surya) (Virat kohli (Lagaan))
testagain blae kaun hai ye banda (Ranbir kapoor (Lagaan), Milkha Singh (On chutti) (Lagaan))

Expected Output:

bla bla1
Rinku Singh
Rohit Sharma
Virat kohli
Ranbir kapoor, Milkha Singh

Conditions to Derive the Expected Output:

Attempted Regex: I tried using the following regular expression on Working Demo of regex:

Regex:

^(?:^[^(]+\([^)]+\) \(([^(]+)\([^)]+\)\))|[^(]+\(([^(]+)\([^)]+\),\s([^\(]+)\([^)]+\)\s\([^\)]+\)\)|(?:(?:.*?)\((.*?)\(.*?\)\))|(?:[^(]+\(([^)]+)\))$

The Regex that I have tried is working fine but I want to improve it with the advice of experts here.

Preferred Languages: Looking to improve this regex OR a Python, or an awk answer is also ok. I myself will also try to add an awk answer.


Solution

  • Purely based on your shown input and your comments reflecting that you need to capture 1 or 2 values per line, here is an optimized regex solution:

    ^(?:\([^)(]*\)|[^()])*\(([^)(]+)(?:\([^)(]*\)[, ]*(?:([^)(]+))?)?
    

    RegEx Demo

    RegEx Details:

    This regex solution does the following:

    Further Details: