I'm working on an audio project. My goal is to count the number of people who spokes in an audio file. We can consider that we already removed the noise from that audio.(for example, if there are two people talking in the audio the program can return 2 if there are three people talking in that audio the program will return 3...). I don't need speech recognition; I just want to know how many people talks. What is the best way to solve this problem?
If I am correct you are looking for speaker diarization
. In this thread someone listed a few options for python.
Python Speaker Recognition
Otherwise if you want to take the easier way, you can let google do it for you with their Cloud Speech-to-text
API. Not free, but also really cool.
More about that right here:
https://cloud.google.com/speech-to-text/docs/multiple-voices