I have this block of text. It comes from a subtitle file.
1[p]00:06:48,564 --> 00:06:50,814[p]Chúng ta đâu cần bận tâm vì bị đuổi khỏi trường.[pp]2[p]00:06:50,864 --> 00:06:53,914[p]Chiến tranh có thể xảy ra bất cứ lúc nào. Và rồi chúng ta cũng sẽ" phải rời trường thôi.[pp]3[p]00:06:53,954 --> 00:06:55,794[p]Chiến tranh'! Không tuyệt sao, Scarlett?[pp]4[p]00:06:55,844 --> 00:06:57,764[p]Cậu biết không bọn miền Bắc thực sự muốn chiến tranh?[pp]5[p]00:06:57,824 --> 00:07:00,104[p]- Ta sẽ cho b'ọn chúng biết tay.[n][-] Fiddle-dee-dee![pp]6[p]00:07:00,134 --> 00:07:01,544[p]Chiến tranh, "lúc nào" cũng chiến tranh![pp]7[p]00:07:01,584 --> 00:07:04,524[p]Chuyện chiến" tranh "vớ vẩn làm hỏng hết các cuộc vui trong suốt mùa xuân này.[pp]
In the text above, the text between [p] and [pp] is the subtitle line of the file. I want to use regex to match the text between a [p] and [pp] that contain one quote character, in other word I want to find subtitle line that have a missing quote. I have built this RegEx construct. I used it with the search function in the QuickEdit app for Android but it has problem.
(?<=\[p\])(?!(?:\d{2}\:\d{2}\:\d{2},\d{3} --> \d{2}\:\d{2}\:\d{2},\d{3}))([^\"]+?\"[^\"]+?)(?=\[pp\])
My question is, why does my RegEx construct above not only select the correct text section that contain one quote character but include the [pp] string and text line from the previous one too. Do you know how to fix the problem. Thank you.
The pattern [^\"]+?
does not exclude [p]
and [pp]
, so patterns like
(?<=\[p\])[^\"]+?(?=\[pp\])
^^^^^^^ <- potentially captures [p] or [pp]
is not guarateed to only capture one section of [p]...[pp]
.
To fix it, you might want to replace [^\"]+?
to exclude them with a negative lookahead:
(?:(?!\[pp?\])[^\"])+?
Before it matches every [^\"]
, it also makes sure the following sequence is not \[pp?\]
.
Here's the full regex
(?<=\[p\])(?!(?:\d{2}\:\d{2}\:\d{2},\d{3} --> \d{2}\:\d{2}\:\d{2},\d{3}))((?:(?!\[pp?\])[^\"])+?\"(?:(?!\[pp?\])[^\"])+?)(?=\[pp\])
Check the test case