pythonstringpython-3.7text-segmentation

String Segmentation


Solved

I have a string which has a conversation between two people along with their speaker tag.

I want to split the string into two sub strings containing speaker 1 and speaker 2 conversation only.

This is the code I am using to obtain the transcript.

operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=10000)
result = response.results[-1]
words_info = result.alternatives[0].words
transcript = ''
tag=1
speaker=""
for word_info in words_info:
    if word_info.speaker_tag==tag:
        speaker=speaker+" "+word_info.word
    else:
        transcript += "speaker {}: {}".format(tag,speaker) + '\n'
        tag=word_info.speaker_tag
        speaker=""+word_info.word
transcript += "speaker {}: {}".format(tag,speaker)

This transcribes both speaker 1 and speaker 2 in the same file.

Solved: The Solution was much simpler. Thanks for the help.

transcript_1 = ''
transcript_2 = ''

for word_info in words_info:
    if word_info.speaker_tag==1:
        #speaker += " "+word_info.word
        transcript_1 += " " + word_info.word
    elif word_info.speaker_tag==2:
        #speaker += " "+word_info.word
        transcript_2 += " " + word_info.word

Solution

  • Depending of how do you get the data, I mean, if you get an unique raw string with all the messages from both speakers or you get the messages from each speaker separately.

    A basic approach would be to establish the string "speaker X:" (where N is the speaker number) as the speaker tag for the first speaker then you could extract each message from each speaker using tools like NLTK and/or built-in functions like find().

    Note: When I talk about a tag, I refer to some expression that would allow us to determine if a message is from a certain speaker or not.

    Example: You get the whole text that includes all the interventions of the speakers.

    1) Set all speakers tags to distinguish their interventions in the whole text. Example: The speaker tag for the first speaker could be "speaker 1:"

    2) Find all the interventions of a speaker using str.find("speaker_tag")

    3) Add all the interventions of each speaker to different data structures. I think that a list of interventions of the speaker could be useful and then if you want to get all these interventions in one text message again, you could use some built-in function like str.join() to joining them into one string again.

    Other option to solve this would be using a tool like NLTK (I think this tool is great to classify text)

    It has very useful features like tokenization that I think it's would be useful to solve your problem.

    In the following example, I am going to use find() and slicing for a basic example about text tokenization:

    Text data:

    text = "speaker 1: hello everyone, I am Thomas speaker 2: Hello friends, I am John speaker 1: How are you? I am great being here speaker 2: It's the same for me"
    

    Code example:

    from itertools import islice, tee
    
    FIRST_SPEAKER_TAG = "speaker 1:"
    SECOND_SPEAKER_TAG = "speaker 2:"
    
    def get_speaker_positions(text, speaker_tag):
    
        total_interventions = text.count(speaker_tag)
        positions = []
        position = 0
        for i in range(total_interventions):
            positions.append(text.find(speaker_tag, position))
            # we increase the position by the addition of all the previous 
            # positions to reach the following occurrences through the list of 
            # positions
        position += sum(positions) + 1
    
        return positions
    
    def slices(iterable, n):
        return zip(*(islice(it, i, None) for i, it in enumerate(tee(iterable, n))))
    
    def get_text_interventions(text, speaker_tags):
    
        # speakers' interventions of the text
        interventions = { speaker_tag: "" for speaker_tag in speaker_tags }
    
        # positions where start each intervention in the text
        # (the last one is used to get the rest of the text, because it's the 
        # last intervention)
        # (we need to sort the positions to get the interventions in the correct 
        # order)
        speaker_positions = [
            get_speaker_positions(text, speaker) for speaker in speaker_tags
        ]
        all_positions = [
            position for sublist in speaker_positions for position in sublist
        ]
        all_positions.append(len(text))
        all_positions.sort()
    
        # generate the list of pairs that match a certain intervention
        # the pairs are formed by the initial and the end position of the 
        # intervention
        text_chunks = list(slices(all_positions, 2))
    
        for chunk in text_chunks:
    
            # we assign the intervention according to which 
            # list of speaker interventions the position exists
            # when slicing we add the speaker tag's length to exclude 
            # the speaker tag from the own intervention
            if chunk[0] in speaker_positions[0]:
                intervention = text[chunk[0]+len(speaker_tags[0]):chunk[1]]
                interventions[speaker_tags[0]] += intervention
    
            elif chunk[0] in speaker_positions[1]:
                intervention = text[chunk[0]+len(speaker_tags[1]):chunk[1]]
                interventions[speaker_tags[1]] += intervention
    
        return interventions
    
    text_interventions = get_text_interventions(text, [ FIRST_SPEAKER_TAG, SECOND_SPEAKER_TAG ])
    

    Notes:

    If you have any doubt, you can read more details in the itertools documentation:

    Feel free to ask me anything you didn't understand about the example. I hope you find it useful! =)