I'm trying to wrap my head around this task and wondering if there is a standard way of doing this or some libraries that would be useful.
Certain events are tracked and timed at several data sources S1 ... SN. The recorded information is the event type and timestamp. There may be several events of the same type sequentially or they may be intermittent. There could be "missing" events - i.e. when one of the sources misses it, and, vice versa, when a source introduces a "false positive". There is typically a time difference between observations of the same event at different sources. This time difference has a constant component due to physical location of the sources but may also have a varying component introduced by network latency and other factors.
I need to find an algorithm that would find the optimal maximum time interval that should be used to group the observations at all sources in a single "observed event" and allow detection of the missing events and false positives.
I am wondering if the solution is really somewhere in the statistics field rather than algoritghms. Any input would be much appreciated.
Sounds like you're building an attendance system :-) In the system I'm building currently this kind of grouping observations is also necessary. In my case there's employees that have a pass that they will put in front off a passreader to register their attendance. First the system will select all attendances from one employee. Then it will put them in boxes of one day, ordered by registration time. Every registration will be assessed on whether it's a start or stop. If the first registration is a start registration, then the system will search for a stop registration up to maximally 12 hours later. If the stop doesn't come a stop is inserted. Additional intelligence can be put into place when the planning is known. Perhapse you could use statistics, but in my case it was a question of algorithms, combined with knowledge of the organisation.