synchronization

Syncing different audio streams for one video


So I wish to have a raspberry pi broadcasting wifi and a web server with PHP\Python cgi-bin. Th objective is to be able to receive up to 5 different short streams (up to 3 minutes) from people and sync them (to the millisecond). Here is the plan:

  1. Use an HTML with getUserMedia method to get video\audio from the user.
  2. Upload the blob produced to the raspberry, along with an argument containing Date.now() that I produced when the users started their recording (thus giving me the knowledge of the difference in ms between all medias).
  3. At the server side all files are trimmed (using ffmpeg or somthing) to the same length, while their starting point trimmed according to the data received from Date.now().

Now, I foresee a possible issue: what if not all users have the same time on their devices (e.g. due to using different cell providers)? Is there such possibility?

Is there a way for me to sync all local time of the browsers while using the... uuuuuuum... let's call it an app? Does connection speed and lags due to system resources may cause deltas in ms between the users?


Solution

  • …and sync them (to the millisecond).

    Not physically possible.

    Even if you assume that all devices had exactly the same time at one moment, down to the nanosecond or even better, you cannot assume that the next sample or frame will be in sync. Each device has its own clock and will drift from the others. What may be 48 kHz to your device's audio interface might be 48.001 kHz to mine. You can certainly expect millisecond-scale drift over the course of a few minutes between arbitrary consumer devices.

    You need to completely re-think your approach. What do you actually want your videos to be relative to?

    For example, it was popular awhile back for several people to dub over a musician's video, adding their own music track. Another musician would add another, and another. Even after dozens did this, they all remained in-sync. The reason is that the dubbing musicians device was playing and recording at the same time, so that the pre-recorded video was being played back with the same clock that the derivative video was being recorded.

    If your videos must be live, then you need some common clock source, and I don't mean clock-on-the-wall time. In the professional world, it's common to actually use GNSS receivers which provide clock references with very high reliability.

    Only once you've decided how to solve this at this finer scale can you begin to figure out how you'll handle actual time offsets.


    Update

    With more information known from the discussion in comments, let me clarify a bit more.

    The scenario as I understand it is two to five devices in a room, all recording simultaneously. One device will serve as the camera and will record video. The other devices will record audio, as a sort of lavalier microphone.

    Now, back to the original goal of strict synchronization... no, it is still not possible. As stated previously, these devices are going to drift on their own because there is no common clock reference. That includes clock-on-the-wall time, and audio sample clock. They'll be close, but they will drift.

    So, what to do? Change your requirements. This is why I was insisting on knowing what we're actually doing here. We can't cheat physics, so we need to figure out what tradeoffs we need to make to end up with something functional. Here are some of the possibilities.

    Record live like a WebRTC conference call

    The whole WebRTC stack is set up for low latency, and it has a lot of stuff for resampling audio to keep it as realtime as possible. Additionally, the usual implementations of the media capture side have built-in noise cancellation and automatic gain control. These are brutal for music, but suitable for your use case of people speaking.

    Basically, you'd create a WebRTC call between all the devices in the session and a "server" that would do the recording. This could be your Pi, or even an app on one of the phones. You'll undoubtedly get a little echo when voices are picked up by multiple microphones, but many mobile devices are pretty good with noise cancellation these days by using phased directional microphones. Is it good enough? You'll have to experiment to find out. The good news is that you can test without building anything. Start up a Google Meet or similar, join some 'microphone' devices together, and record, and give it a try.

    Ignore the consequences, fix it in post

    You could just basically record on all the devices and then assemble the audio manually later. Just remember... streams will drift over time, so it's not enough to just set initial offsets.

    Pick one mic at a time

    This can be applied to any method... if someone is speaking, mute or attenuate the other mics. A WebRTC MCU will do this for you.

    Don't. Do something different instead.

    If the imperfect solutions aren't of good enough quality, consider doing something different. While having a mic on each person is one of the better ways to pick up those voices, it's not the only way. You could simply have a better placed omnidirectional mic in the room and connect it to the video recording device. There are conference room array microphones you could get that use beamforming phased arrays to "zoom in" on a voice.

    Whatever solution you choose, just remember the physics of the problem you're trying to solve, and the tradeoffs you're willing to make. No solution is perfect in every way.