sqlgraph-databasesorientdb

How to write a select query or server-side function that will generate a neat time-flow graph from many data points?


Notes:

I have data that is represented in a time-flow manner; i.e. EventC occurs after EventB which occurs after EventA, etc. This data is coming from multiple sources, so it is not completely linear. It needs to be congregated together, which is where I'm having the issue.

Currently, the data looks something like this:

#     |  event   |  next
--------------------------
12:0  |  EventA  |  12:1
12:1  |  EventB  |  12:2
12:2  |  EventC  |  
12:3  |  EventA  |  12:4
12:4  |  EventD  |  

Where "next" is the out() edge to the event that comes next in the time-flow. On a graph, this comes out to look like:

EventA-->EventB-->EventC
EventA-->EventD

Since this data needs to be congregated together, I need to merge duplicate events but preserve their edges. In other words, I need a select query that will result in:

        -->EventB-->EventC
EventA--|
        -->EventD

In this example, since EventB and EventD both occurred after EventA (just at different times), the select query will show two branches off EventA as opposed to two separate time-flows.


If an additional set of data were to be added to the data above, with EventB->EventE, the resulting data/graph would look like:

#     |  event   |  next
--------------------------
12:0  |  EventA  |  12:1
12:1  |  EventB  |  12:2
12:2  |  EventC  |  
12:3  |  EventA  |  12:4
12:4  |  EventD  |  
12:5  |  EventB  |  12:6
12:6  |  EventE  |  

EventA-->EventB-->EventC
EventA-->EventD
EventB-->EventE

I need a query to produce a tree like:

                    -->EventC
        -->EventB--|
        |           -->EventE
EventA--|
        -->EventD

Here is the data with edges shown as opposed to the "next" column above. I also added a couple additional columns here to hopefully clear up any confusion about the data:

#     |  event   |    ip_address    |       timestamp     |   in   |  out  |
----------------------------------------------------------------------------
12:0  |  EventA  |  123.156.189.18  | 2015-04-17 12:48:01 |        |  13:0 |
12:1  |  EventB  |  123.156.189.18  | 2015-04-17 12:48:32 |  13:0  |  13:1 |
12:2  |  EventC  |  123.156.189.18  | 2015-04-17 12:48:49 |  13:1  |       |
12:3  |  EventA  |  103.145.187.22  | 2015-04-17 14:03:08 |        |  13:2 |
12:4  |  EventD  |  103.145.187.22  | 2015-04-17 14:05:23 |  13:2  |       |
12:5  |  EventB  |  96.109.199.184  | 2015-04-17 21:53:00 |        |  13:3 |
12:6  |  EventE  |  96.109.199.184  | 2015-04-17 21:53:07 |  13:3  |       |

The data is saved like this to preserve each individual event and the flow of a session (labeled by the ip address).

TL;DR

Got lots of events, some duplicates, and need them all organized into one neat time-flow graph.


Solution

  • This isn't optimized for performance (oh the loops!), but gets the job done for the time being while I can work on performance. The resulting OrientDB server-side function (written in javascript):

    The function:

    // Clear previous runs
    db.command("truncate class tmp_Then");
    db.command("truncate class tmp_Events");
    
    // Get all distinct events
    var distinctEvents = db.query("select from Events group by event");
    
    // Send 404 if null, otherwise proceed
    if (distinctEvents == null) {
      response.send(404, "Events not found", "text/plain", "Error: events not found" );
    } else {
      var edges = [];
    
      // Loop through all distinct events
      distinctEvents.forEach(function(distinctEvent) {
        var newEvent = [];
        var rid = distinctEvent.field("@rid");
        var eventType = distinctEvent.field("event");
        
        // The main query that finds all *direct* descendents of the distinct event
        var result = db.query("select from (traverse * from (select from Events where event = ?) where $depth <= 2) where @class = 'Events' and $depth > 1 and @rid in (select from Events group by event)", [eventType]);
        
        // Save the distinct event in a temp table to create temp edges
        db.command("create vertex tmp_Events set rid = ?, event = ?", [rid, event]);
          edges.push(result);
        });
    
      // The edges array defines which edges should exist for a given event
      edges.forEach(function(edge, index) {
        edge.forEach(function(e) {
          // Create the temp edge that corresponds to its distinct event
          db.command("create edge tmp_Then from (select from tmp_Events where rid = " + distinctEvents[index].field("@rid") + ") to (select from tmp_Events where rid = " + e.field("@rid") + ")");
        });
      });
    
      var result = db.query("select from tmp_Events");
      return result;
    }
    

    Takeaways: