iosvideosynchronizationavfoundationcore-motion

iOS: Synchronizing frames from camera and motion data


I'm trying to capture frames from camera and associated motion data. For synchronization I'm using timestamps. Video and motion is written to a file and then processed. In that process I can calculate motion-frames offset for every video.

Turns out motion data and video data for same timestamp is offset from each other by different time from 0.2 sec up to 0.3 sec. This offset is constant for one video but varies from video to video. If it was same offset every time I would be able to subtract some calibrated value but it's not.

Is there a good way to synchronize timestamps? Maybe I'm not recording them correctly? Is there a better way to bring them to the same frame of reference?

CoreMotion returns timestamps relative to system uptime so I add offset to get unix time:

uptimeOffset = [[NSDate date] timeIntervalSince1970] - 
                   [NSProcessInfo processInfo].systemUptime;

CMDeviceMotionHandler blk =
    ^(CMDeviceMotion * _Nullable motion, NSError * _Nullable error){
        if(!error){
            motionTimestamp = motion.timestamp + uptimeOffset;
            ...
        }
    };

[motionManager startDeviceMotionUpdatesUsingReferenceFrame:CMAttitudeReferenceFrameXTrueNorthZVertical
                                                   toQueue:[NSOperationQueue currentQueue]
                                               withHandler:blk];

To get frames timestamps with high precision I'm using AVCaptureVideoDataOutputSampleBufferDelegate. It is offset to unix time also:

-(void)captureOutput:(AVCaptureOutput *)captureOutput
didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer
       fromConnection:(AVCaptureConnection *)connection
{
    CMTime frameTime = CMSampleBufferGetOutputPresentationTimeStamp(sampleBuffer);

    if(firstFrame)
    {
        firstFrameTime = CMTimeMake(frameTime.value, frameTime.timescale);
        startOfRecording = [[NSDate date] timeIntervalSince1970];
    }

    CMTime presentationTime = CMTimeSubtract(frameTime, firstFrameTime);
    float seconds = CMTimeGetSeconds(presentationTime);

    frameTimestamp = seconds + startOfRecording;
    ...
}

Solution

  • It is actually pretty simple to correlate these timestamps - although it's not clearly documented, both camera frame and motion data timestamps are based on the mach_absolute_time() timebase.

    This is a monotonic timer that is reset at boot, but importantly also stops counting when the device is asleep. So there's no easy way to convert it to a standard "wall clock" time.

    Thankfully you don't need to as the timestamps are directly comparable - motion.timestamp is in seconds, you can log out mach_absolute_time() in the callback to see it is the same timebase. My quick test shows the motion timestamp is typically about 2ms before mach_absolute_time in the handler, which seems about right for how long it might take for the data to get reported to the app.

    Note mach_absolute_time() is in tick units that need conversion to nanoseconds; on iOS 10 and later you can just use the equivalent clock_gettime_nsec_np(CLOCK_UPTIME_RAW); which does the same thing.

        [_motionManager
         startDeviceMotionUpdatesUsingReferenceFrame:CMAttitudeReferenceFrameXArbitraryZVertical
         toQueue:[NSOperationQueue currentQueue]
         withHandler:^(CMDeviceMotion * _Nullable motion, NSError * _Nullable error) {
            // motion.timestamp is in seconds; convert to nanoseconds
            uint64_t motionTimestampNs = (uint64_t)(motion.timestamp * 1e9);
            
            // Get conversion factors from ticks to nanoseconds
            struct mach_timebase_info timebase;
            mach_timebase_info(&timebase);
            
            // mach_absolute_time in nanoseconds
            uint64_t ticks = mach_absolute_time();
            uint64_t machTimeNs = (ticks * timebase.numer) / timebase.denom;
            
            int64_t difference = machTimeNs - motionTimestampNs;
            
            NSLog(@"Motion timestamp: %llu, machTime: %llu, difference %lli", motionTimestampNs, machTimeNs, difference);
        }];
    

    For the camera, the timebase is also the same:

    // In practice gives the same value as the CMSampleBufferGetOutputPresentationTimeStamp
    // but this is the media's "source" timestamp which feels more correct
    CMTime frameTime = CMSampleBufferGetPresentationTimeStamp(sampleBuffer);
    uint64_t frameTimestampNs = (uint64_t)(CMTimeGetSeconds(frameTime) * 1e9);
    

    The delay between the timestamp and the handler being called is a bit larger here, usually in the 10s of milliseconds.

    We now need to consider what a timestamp on a camera frame actually means - there are two issues here; finite exposure time, and rolling shutter.

    Rolling shutter means that not all scanlines of the image are actually captured at the same time - the top row is captured first and the bottom row last. This rolling readout of the data is spread over the entire frame time, so in 30 FPS camera mode the final scanline's exposure start/end time is almost exactly 1/30 second after the respective start/end time of the first scanline.

    My tests indicate the presentation timestamp in the AVFoundation frames is the start of the readout of the frame - ie the end of the exposure of the first scanline. So the end of the exposure of the final scanline is frameDuration seconds after this, and the start of the exposure of the first scanline was exposureTime seconds before this. So a timestamp right in the centre of the frame exposure (the midpoint of the exposure of the middle scanline of the image) can be calculated as:

    const double frameDuration = 1.0/30; // rolling shutter effect, depends on camera mode
    const double exposure = avCaptureDevice.exposureDuration;
    CMTime frameTime = CMSampleBufferGetPresentationTimeStamp(sampleBuffer);
    double midFrameTime = CMTimeGetSeconds(frameTime) - exposure * 0.5 + frameDuration * 0.5;
    

    In indoor settings, the exposure usually ends up the full frame time anyway, so the midFrameTime from above ends up identical to the frameTime. The difference is noticeable (under extremely fast motion) with short exposures that you typically get from brightly lit outdoor scenes.

    Why the original approach had different offsets

    I think the main cause of your offset is that you assume the timestamp of the first frame is the time that the handler runs - ie it doesn't account for any delay between capturing the data and it being delivered to your app. Especially if you're using the main queue for these handlers I can imagine the callback for that first frame being delayed by the 0.2-0.3s you mention.