I'm creating a audio classification model for animal sounds. It's a hobby project, just to get myself familiarized with the techniques. The thing that I'm struggling with is the duration differences of my audio clips and how I should cut them into similar duration lengths. It is not so much on the how (because I found many examples on how to split the audio files) but my question is about the duration itself.
My files have some silences but mainly also a lot of repetitive sounds as the dataset is mainly insects. And the insect, like a cricket will make a similar sound, repetitive sound, for a long time. So my idea was: if there is a way to detect repetitions in audio files, use that to split the audio file. And then see what the duration is of the longest clip, and use that as a duration to cut split all the audio files.
But maybe I'm thinking about it all wrong. Does anybody have any suggestions or nice literature for me?
As I have done a classification of insects sounds myself recently (grasshoppers, cicada etc.,) I can tell that you would probably need audio chunks of various sizes. I had experimented with sizes between 0.5 and 60 seconds, and they all show specific patterns that bear valuable information.
To get better results I did two things: First I combined a longer time window with a short focus time window. Example 1 shows the spectrogram of a long time window of 60 secs (upper part) with a focus window of 0.6 seconds. In Example 2 I have combined a long time window of 40 secs with four focus windows of 2 secs.
A final step can be done for all of the different time windows: You can use an ensemble method, such as voting, to improve the results.