pythonregextext-segmentation

How do I split a paragraph between customer and customer service agent based on rules?


I have a paragraph that records the conversation between a customer and a customer service agent. How do I separate apart the conversation and create two lists (or any other format like a dictionary) with one that only contains the customer's text and the other one that only contains the agent's text?

Example paragraph:
Agent Name: Hello! My name is X. How can I help you today? ( 4m 46s ) Customer: My name is Y. Here is my issue ( 4m 57s ) Agent Name: Here's the solution ( 5m 40s ) Agent Name: Are you there? ( 6m 30s ) Customer: Yes I'm still here. I still don't understand... ( 6m 40s ) Agent Name: Ok. Let's try another way. ( 6m 50s ) Agent Name: Does that solve the problem? ( 7m 40s ) Agent Name: Thank you for contacting the customer service.

Expected Output:
List that only contains agent's text: ['Agent Name: Hello! My name is X. How can I help you today? ( 4m 46s )', 'Agent Name: Are you there? ( 6m 30s )', 'Agent Name: Ok. Let's try another way. ( 6m 50s )', 'Agent Name: Does that solve the problem? (7m 40s) Agent Name: Thank you for contacting the customer service.']

List that only contains customer's text: ['Customer: My name is Y. Here is my issue ( 4m 57s )', 'Customer: Yes I'm still here. I still don't understand... ( 6m 40s )'].

Thank you!


Solution

  • given:

    txt='''\
    Agent Name: Hello! My name is X. How can I help you today? ( 4m 46s ) Customer: My name is Y. Here is my issue ( 4m 57s ) Agent Name: Here's the solution ( 5m 40s ) Agent Name: Are you there? ( 6m 30s ) Customer: Yes I'm still here. I still don't understand... ( 6m 40s ) Agent Name: Ok. Let's try another way. ( 6m 50s ) Agent Name: Does that solve the problem? (7m 40s) Agent Name: Thank you for contacting the customer service.'''
    

    You can use re.findall:

    s1='Agent Name:'
    s2='Customer:'
    >>> re.findall(rf'({s1}.*?(?={s2}|\Z))', txt)
    ['Agent Name: Hello! My name is X. How can I help you today? ( 4m 46s ) ', "Agent Name: Here's the solution ( 5m 40s ) Agent Name: Are you there? ( 6m 30s ) ", "Agent Name: Ok. Let's try another way. ( 6m 50s ) Agent Name: Does that solve the problem? (7m 40s) Agent Name: Thank you for contacting the customer service."]
    
    >>> re.findall(rf'({s2}.*?(?={s1}|\Z))', txt)
    ['Customer: My name is Y. Here is my issue ( 4m 57s ) ', "Customer: Yes I'm still here. I still don't understand... ( 6m 40s ) "]