algorithmoptimizationtreepipelinedependency-tree

How to sort and chunk a dependency tree of actions, so you can batch as many actions as possible together at each step?


Say you have a bunch of actions for creating/inserting records into a bunch of different database tables. You have some records which can be inserted without any dependency on the output of any other insert. You have some which need to wait for one other thing to finish. And you have others that need to wait for many things to finish, which might finish at different times in the flow.

How can you write an algorithm which would sort and chunk the actions in the dependency tree so the inserts / database actions can be optimally batched? By optimally batched, I mean if you can insert 10 records into the same table at once, then do that. Any time you can batch insert, you should, to minimize the number of database calls/inserts.

Here is a snippet from the example code I whipped together with a fake sequence of actions using a simple data structure to capture all the required information.

...
{ action: 'create', table: 'tb1', set: 'key12', input: {
  p: { type: 'binding', path: ['key10', 'z'] },
  q: { type: 'binding', path: ['key11', 'a'] }
} },
{ action: 'create', table: 'tb4', set: 'key13' },
{ action: 'create', table: 'tb3', set: 'key14' },
{ action: 'create', table: 'tb4', set: 'key15', input: {
  a: { type: 'binding', path: ['key8', 'z'] },
} },
...

Note that there are 4 possible properties on our "dependency node item":

Given that, it should be possible somehow to construct a simple algorithm that chunks the actions into subsets which can be run in parallel AND batched. For example, our code below would end up something like this structure (I manually created this output, so there could be mistakes though I think I got it right. Oh and note, while the tables are numbered, it doesn't imply an order to them necessarily, just were simple names to pick):

// this is "close to" the desired result, because
// there might be a slightly different sort applied
// to this when implemented, but the chunking pattern
// and grouping of elements should be exactly like This
// (I am pretty sure, I manually did this 
// so there could be small mistakes, but I doubt it)
const closeToDesiredResult = [
  [
    [
      { action: 'create', table: 'tb1', set: 'key1' },
      { action: 'create', table: 'tb1', set: 'key21' },
    ],
    [
      { action: 'create', table: 'tb2', set: 'key2' },
      { action: 'create', table: 'tb2', set: 'key3' },
      { action: 'create', table: 'tb2', set: 'key23' },
    ],
    [
      { action: 'create', table: 'tb4', set: 'key6' },
      { action: 'create', table: 'tb4', set: 'key8' },
      { action: 'create', table: 'tb4', set: 'key13' },
    ],
    [
      { action: 'create', table: 'tb3', set: 'key5' },
      { action: 'create', table: 'tb3', set: 'key7' },
      { action: 'create', table: 'tb3', set: 'key9' },
      { action: 'create', table: 'tb3', set: 'key14' },
      { action: 'create', table: 'tb3', set: 'key24' },
    ],
    [
      { action: 'create', table: 'tb6', set: 'key17' },
    ],
    [
      { action: 'create', table: 'tb5', set: 'key16' },
    ]
  ],
  [
    [
      { action: 'create', table: 'tb1', set: 'key4', input: {
        x: { type: 'binding', path: ['key2', 'baz'] }
      } },
    ],
    [
      { action: 'create', table: 'tb3', set: 'key10', input: {
        y: { type: 'binding', path: ['key6', 'foo'] },
        z: { type: 'binding', path: ['key1', 'bar'] }
      } },
    ],
    [
      { action: 'create', table: 'tb4', set: 'key15', input: {
        a: { type: 'binding', path: ['key8', 'z'] },
      } },
    ]
  ],
  [
    [
      { action: 'create', table: 'tb1', set: 'key12', input: {
        p: { type: 'binding', path: ['key10', 'z'] },
        q: { type: 'binding', path: ['key11', 'a'] }
      } },
    ],
    [
      { action: 'create', table: 'tb4', set: 'key11', input: {
        a: { type: 'binding', path: ['key10', 'z'] },
        b: { type: 'binding', path: ['key1', 'bar'] }
      } },
    ],
    [
      { action: 'create', table: 'tb6', set: 'key18', input: {
        m: { type: 'binding', path: ['key4', 'x'] },
      } },
      { action: 'create', table: 'tb6', set: 'key19', input: {
        m: { type: 'binding', path: ['key4', 'x'] },
        n: { type: 'binding', path: ['key13', 'a'] },
      } },
    ]
  ],
  [
    [
      { action: 'create', table: 'tb2', set: 'key22', input: {
        w: { type: 'binding', path: ['key18', 'm'] },
        x: { type: 'binding', path: ['key17', 'm'] },
      } },
    ],
    [
      { action: 'create', table: 'tb6', set: 'key20', input: {
        m: { type: 'binding', path: ['key18', 'm'] },
        n: { type: 'binding', path: ['key17', 'm'] },
      } },
    ]
  ]
]

Notice how there are 4 top-level chunks in the resulting array. These are the main steps. Then within each step, everything is grouped by table, so they can all be run in parallel, and within each table group, they can all be batch inserted. Boom.

How would you implement this, it seems quite tricky for my mind to grasp?

const actionTree = generateActionTree()
const chunkedActionTree = chunkDependencyTree(actionTree)

function chunkDependencyTree(list) {
  const independentOnesMapByTableName = {}
  list.forEach(node => {
    // easy case
    if (!node.input) {
      const group = independentOnesMapByTableName[node.table]
        = independentOnesMapByTableName[node.table] ?? []
      group.push(node)
    } else {
      // I am at a loss for words...
    }
  })
}

function generateActionTree() {
  // this would be constructed through a bunch of real-world
  // functions, queuing up all the actions
  // and pointing outputs to inputs.
  return [
    { action: 'create', table: 'tb1', set: 'key1' },
    { action: 'create', table: 'tb2', set: 'key2' },
    { action: 'create', table: 'tb2', set: 'key3' },
    { action: 'create', table: 'tb3', set: 'key5' },
    { action: 'create', table: 'tb4', set: 'key6' },
    { action: 'create', table: 'tb3', set: 'key7' },
    { action: 'create', table: 'tb4', set: 'key8' },
    { action: 'create', table: 'tb3', set: 'key9' },
    { action: 'create', table: 'tb3', set: 'key10', input: {
      y: { type: 'binding', path: ['key6', 'foo'] },
      z: { type: 'binding', path: ['key1', 'bar'] }
    } },
    { action: 'create', table: 'tb1', set: 'key4', input: {
      x: { type: 'binding', path: ['key2', 'baz'] }
    } },
    { action: 'create', table: 'tb4', set: 'key11', input: {
      a: { type: 'binding', path: ['key10', 'z'] },
      b: { type: 'binding', path: ['key1', 'bar'] }
    } },
    { action: 'create', table: 'tb1', set: 'key12', input: {
      p: { type: 'binding', path: ['key10', 'z'] },
      q: { type: 'binding', path: ['key11', 'a'] }
    } },
    { action: 'create', table: 'tb4', set: 'key13' },
    { action: 'create', table: 'tb3', set: 'key14' },
    { action: 'create', table: 'tb4', set: 'key15', input: {
      a: { type: 'binding', path: ['key8', 'z'] },
    } },
    { action: 'create', table: 'tb5', set: 'key16' },
    { action: 'create', table: 'tb6', set: 'key17' },
    { action: 'create', table: 'tb6', set: 'key18', input: {
      m: { type: 'binding', path: ['key4', 'x'] },
    } },
    { action: 'create', table: 'tb6', set: 'key19', input: {
      m: { type: 'binding', path: ['key4', 'x'] },
      n: { type: 'binding', path: ['key13', 'a'] },
    } },
    { action: 'create', table: 'tb6', set: 'key20', input: {
      m: { type: 'binding', path: ['key18', 'm'] },
      n: { type: 'binding', path: ['key17', 'm'] },
    } },
    { action: 'create', table: 'tb1', set: 'key21' },
    { action: 'create', table: 'tb2', set: 'key22', input: {
      w: { type: 'binding', path: ['key18', 'm'] },
      x: { type: 'binding', path: ['key17', 'm'] },
    } },
    { action: 'create', table: 'tb2', set: 'key23' },
    { action: 'create', table: 'tb3', set: 'key24' },
  ]
}

I think this is roughly topological sorting, but not quite sure how to apply it to this specific situation.


Solution

  • Your data structure isn't clear to me. What are the single letter ids p, q, etc.? I also don't understand the role of tables. You can insert multiple items in the same table in one write, no? I'm assuming these tihngs don't matter in the root sequencing problem.

    I'm treating the set field as a "job" and the corresponding keys mentioned in the inputs as dependencies: jobs that must be completed before it.

    I don't have time to be thorough here, and I don't have a javascript environment handy, so this is in Python.

    Let's extract a dependency graph from the the verbose data structure. Then look for "levels." The first level is all nodes with no dependencies. The second is all nodes with dependencies met in any previous level, etc. Rinse and repeate.

    Note unlike I was thinking in my note in comments, this is not a level graph by the traditional definition.

    Also, I'm not going to bother with data structures to make this efficient. You could do it in O(n log n) time. My code is O(n^2).

    Sorry if I'm misinterpreting your question. Also sorry for untested, possibly buggy implementation here.

    from collections import defaultdict
    
    def ExtractGraph(cmds):
      """Gets a dependency graph from verbose command input data."""
      graph = defaultdict(set)
      for cmd in cmds:
        node = cmd['set']
        graph[node].update(set())
        inputs = cmd.get('input')
        if inputs:
          for _, val in inputs.items():
            graph[node].add(val['path'][0])
      return graph
    
    def FindSources(graph):
      """Returns the nodes of the given graph having no dependencies."""
      sources = set()
      for node, edges in graph.items():
        if not edges:
          sources.add(node)
      return sources
    
    def GetLevels(dependencies):
      """Returns sequence levels satisfying given dependency graph."""
      sources = FindSources(dependencies)
      level = set(sources)
      done = set(level)
      todos = dependencies.keys() - done
      levels = []
      while level:
        levels.append(level)
        # Next level is jobs that have all dependencies done
        new_level = set()
        # A clever data structure could find the next level in O(k log n)
        # for a level size of k and n jobs. This needs O(n).
        for todo in todos:
          if dependencies[todo].issubset(done):
            new_level.add(todo)
        todos.difference_update(new_level)
        done.update(new_level)
        level = new_level
      return levels
    
    cmds = [
        { 'action' : 'create', 'table' : 'tb1', 'set' : 'key1' },
        { 'action' : 'create', 'table' : 'tb2', 'set' : 'key2' },
        { 'action' : 'create', 'table' : 'tb2', 'set' : 'key3' },
        { 'action' : 'create', 'table' : 'tb3', 'set' : 'key5' },
        { 'action' : 'create', 'table' : 'tb4', 'set' : 'key6' },
        { 'action' : 'create', 'table' : 'tb3', 'set' : 'key7' },
        { 'action' : 'create', 'table' : 'tb4', 'set' : 'key8' },
        { 'action' : 'create', 'table' : 'tb3', 'set' : 'key9' },
        { 'action' : 'create', 'table' : 'tb3', 'set' : 'key10', 'input' : {
          'y' : { 'type' : 'binding', 'path' : ['key6', 'foo'] },
          'z' : { 'type' : 'binding', 'path' : ['key1', 'bar'] }
        } },
        { 'action' : 'create', 'table' : 'tb1', 'set' : 'key4', 'input' : {
          'x' : { 'type' : 'binding', 'path' : ['key2', 'baz'] }
        } },
        { 'action' : 'create', 'table' : 'tb4', 'set' : 'key11', 'input' : {
          'a' : { 'type' : 'binding', 'path' : ['key10', 'z'] },
          'b' : { 'type' : 'binding', 'path' : ['key1', 'bar'] }
        } },
        { 'action' : 'create', 'table' : 'tb1', 'set' : 'key12', 'input' : {
          'p' : { 'type' : 'binding', 'path' : ['key10', 'z'] },
          'q' : { 'type' : 'binding', 'path' : ['key11', 'a'] }
        } },
        { 'action' : 'create', 'table' : 'tb4', 'set' : 'key13' },
        { 'action' : 'create', 'table' : 'tb3', 'set' : 'key14' },
        { 'action' : 'create', 'table' : 'tb4', 'set' : 'key15', 'input' : {
          'a' : { 'type' : 'binding', 'path' : ['key8', 'z'] },
        } },
        { 'action' : 'create', 'table' : 'tb5', 'set' : 'key16' },
        { 'action' : 'create', 'table' : 'tb6', 'set' : 'key17' },
        { 'action' : 'create', 'table' : 'tb6', 'set' : 'key18', 'input' : {
          'm' : { 'type' : 'binding', 'path' : ['key4', 'x'] },
        } },
        { 'action' : 'create', 'table' : 'tb6', 'set' : 'key19', 'input' : {
          'm' : { 'type' : 'binding', 'path' : ['key4', 'x'] },
          'n' : { 'type' : 'binding', 'path' : ['key13', 'a'] },
        } },
        { 'action' : 'create', 'table' : 'tb6', 'set' : 'key20', 'input' : {
          'm' : { 'type' : 'binding', 'path' : ['key18', 'm'] },
          'n' : { 'type' : 'binding', 'path' : ['key17', 'm'] },
        } },
        { 'action' : 'create', 'table' : 'tb1', 'set' : 'key21' },
        { 'action' : 'create', 'table' : 'tb2', 'set' : 'key22', 'input' : {
          'w' : { 'type' : 'binding', 'path' : ['key18', 'm'] },
          'x' : { 'type' : 'binding', 'path' : ['key17', 'm'] },
        } },
        { 'action' : 'create', 'table' : 'tb2', 'set' : 'key23' },
        { 'action' : 'create', 'table' : 'tb3', 'set' : 'key24' },
    ]
        
    dependencies = ExtractGraph(cmds)
    levels = GetLevels(dependencies)
    print(levels)
    

    When run, this finds a few levels:

    [
    {'key9', 'key3', 'key24', 'key7', 'key17', 'key8', 'key21', 'key1',
         'key5', 'key2', 'key16', 'key6', 'key23', 'key13', 'key14'}, 
    {'key15', 'key4', 'key10'}, 
    {'key19', 'key18', 'key11'}, 
    {'key22', 'key12', 'key20'}
    ]
    

    For a spot check, let's look at key12. It has 10 and 11 as dependencies. key10 has 6 and 1. key11 has 10 and 1. keys 1 and 6 have none. We find

    Each job is being done as soon as its dependencies are met. So that's encouraging. More thorough testing is a must, however.

    If you need to group these levels further into separate inserts per table, that's a post-processing step. The resulting inserts within a level can be done in parallel.