androidregexquotessubtitle

RegEx fails - Looking for subtitle missing quotes


I have this block of text. It comes from a subtitle file.

1[p]00:06:48,564 --> 00:06:50,814[p]Chúng ta đâu cần bận tâm vì bị đuổi khỏi trường.[pp]2[p]00:06:50,864 --> 00:06:53,914[p]Chiến tranh có thể xảy ra bất cứ lúc nào. Và rồi chúng ta cũng sẽ" phải rời trường thôi.[pp]3[p]00:06:53,954 --> 00:06:55,794[p]Chiến tranh'! Không tuyệt sao, Scarlett?[pp]4[p]00:06:55,844 --> 00:06:57,764[p]Cậu biết không bọn miền Bắc thực sự muốn chiến tranh?[pp]5[p]00:06:57,824 --> 00:07:00,104[p]- Ta sẽ cho b'ọn chúng biết tay.[n][-] Fiddle-dee-dee![pp]6[p]00:07:00,134 --> 00:07:01,544[p]Chiến tranh, "lúc nào" cũng chiến tranh![pp]7[p]00:07:01,584 --> 00:07:04,524[p]Chuyện chiến" tranh "vớ vẩn làm hỏng hết các cuộc vui trong suốt mùa xuân này.[pp]

In the text above, the text between [p] and [pp] is the subtitle line of the file. I want to use regex to match the text between a [p] and [pp] that contain one quote character, in other word I want to find subtitle line that have a missing quote. I have built this RegEx construct. I used it with the search function in the QuickEdit app for Android but it has problem.

(?<=\[p\])(?!(?:\d{2}\:\d{2}\:\d{2},\d{3} --> \d{2}\:\d{2}\:\d{2},\d{3}))([^\"]+?\"[^\"]+?)(?=\[pp\])

My question is, why does my RegEx construct above not only select the correct text section that contain one quote character but include the [pp] string and text line from the previous one too. Do you know how to fix the problem. Thank you.


Solution

  • The pattern [^\"]+? does not exclude [p] and [pp], so patterns like

    (?<=\[p\])[^\"]+?(?=\[pp\])
              ^^^^^^^ <- potentially captures [p] or [pp]
    

    is not guarateed to only capture one section of [p]...[pp].

    To fix it, you might want to replace [^\"]+? to exclude them with a negative lookahead:

    (?:(?!\[pp?\])[^\"])+?
    

    Before it matches every [^\"], it also makes sure the following sequence is not \[pp?\].

    Here's the full regex

    (?<=\[p\])(?!(?:\d{2}\:\d{2}\:\d{2},\d{3} --> \d{2}\:\d{2}\:\d{2},\d{3}))((?:(?!\[pp?\])[^\"])+?\"(?:(?!\[pp?\])[^\"])+?)(?=\[pp\])
    

    Check the test case