Say you have a bunch of actions for creating/inserting records into a bunch of different database tables. You have some records which can be inserted without any dependency on the output of any other insert. You have some which need to wait for one other thing to finish. And you have others that need to wait for many things to finish, which might finish at different times in the flow.
How can you write an algorithm which would sort and chunk the actions in the dependency tree so the inserts / database actions can be optimally batched? By optimally batched, I mean if you can insert 10 records into the same table at once, then do that. Any time you can batch insert, you should, to minimize the number of database calls/inserts.
Here is a snippet from the example code I whipped together with a fake sequence of actions using a simple data structure to capture all the required information.
...
{ action: 'create', table: 'tb1', set: 'key12', input: {
p: { type: 'binding', path: ['key10', 'z'] },
q: { type: 'binding', path: ['key11', 'a'] }
} },
{ action: 'create', table: 'tb4', set: 'key13' },
{ action: 'create', table: 'tb3', set: 'key14' },
{ action: 'create', table: 'tb4', set: 'key15', input: {
a: { type: 'binding', path: ['key8', 'z'] },
} },
...
Note that there are 4 possible properties on our "dependency node item":
action
: This is always "create" in our situation, but potetntially could be other things in the future.table
: The table name to insert into.set
: The name of the variable to add to the shared global "scope" in the action dependency tree, so other actions can read this as input.input
: Inputs to the action, which in our case are all "binding" inputs (but could just as well be literal values, but that would be too easy). For binding inputs, it reads some property/column value from a record stored in the shared scope of the dependency tree.Given that, it should be possible somehow to construct a simple algorithm that chunks the actions into subsets which can be run in parallel AND batched. For example, our code below would end up something like this structure (I manually created this output, so there could be mistakes though I think I got it right. Oh and note, while the tables are numbered, it doesn't imply an order to them necessarily, just were simple names to pick):
// this is "close to" the desired result, because
// there might be a slightly different sort applied
// to this when implemented, but the chunking pattern
// and grouping of elements should be exactly like This
// (I am pretty sure, I manually did this
// so there could be small mistakes, but I doubt it)
const closeToDesiredResult = [
[
[
{ action: 'create', table: 'tb1', set: 'key1' },
{ action: 'create', table: 'tb1', set: 'key21' },
],
[
{ action: 'create', table: 'tb2', set: 'key2' },
{ action: 'create', table: 'tb2', set: 'key3' },
{ action: 'create', table: 'tb2', set: 'key23' },
],
[
{ action: 'create', table: 'tb4', set: 'key6' },
{ action: 'create', table: 'tb4', set: 'key8' },
{ action: 'create', table: 'tb4', set: 'key13' },
],
[
{ action: 'create', table: 'tb3', set: 'key5' },
{ action: 'create', table: 'tb3', set: 'key7' },
{ action: 'create', table: 'tb3', set: 'key9' },
{ action: 'create', table: 'tb3', set: 'key14' },
{ action: 'create', table: 'tb3', set: 'key24' },
],
[
{ action: 'create', table: 'tb6', set: 'key17' },
],
[
{ action: 'create', table: 'tb5', set: 'key16' },
]
],
[
[
{ action: 'create', table: 'tb1', set: 'key4', input: {
x: { type: 'binding', path: ['key2', 'baz'] }
} },
],
[
{ action: 'create', table: 'tb3', set: 'key10', input: {
y: { type: 'binding', path: ['key6', 'foo'] },
z: { type: 'binding', path: ['key1', 'bar'] }
} },
],
[
{ action: 'create', table: 'tb4', set: 'key15', input: {
a: { type: 'binding', path: ['key8', 'z'] },
} },
]
],
[
[
{ action: 'create', table: 'tb1', set: 'key12', input: {
p: { type: 'binding', path: ['key10', 'z'] },
q: { type: 'binding', path: ['key11', 'a'] }
} },
],
[
{ action: 'create', table: 'tb4', set: 'key11', input: {
a: { type: 'binding', path: ['key10', 'z'] },
b: { type: 'binding', path: ['key1', 'bar'] }
} },
],
[
{ action: 'create', table: 'tb6', set: 'key18', input: {
m: { type: 'binding', path: ['key4', 'x'] },
} },
{ action: 'create', table: 'tb6', set: 'key19', input: {
m: { type: 'binding', path: ['key4', 'x'] },
n: { type: 'binding', path: ['key13', 'a'] },
} },
]
],
[
[
{ action: 'create', table: 'tb2', set: 'key22', input: {
w: { type: 'binding', path: ['key18', 'm'] },
x: { type: 'binding', path: ['key17', 'm'] },
} },
],
[
{ action: 'create', table: 'tb6', set: 'key20', input: {
m: { type: 'binding', path: ['key18', 'm'] },
n: { type: 'binding', path: ['key17', 'm'] },
} },
]
]
]
Notice how there are 4 top-level chunks in the resulting array. These are the main steps. Then within each step, everything is grouped by table, so they can all be run in parallel, and within each table group, they can all be batch inserted. Boom.
How would you implement this, it seems quite tricky for my mind to grasp?
const actionTree = generateActionTree()
const chunkedActionTree = chunkDependencyTree(actionTree)
function chunkDependencyTree(list) {
const independentOnesMapByTableName = {}
list.forEach(node => {
// easy case
if (!node.input) {
const group = independentOnesMapByTableName[node.table]
= independentOnesMapByTableName[node.table] ?? []
group.push(node)
} else {
// I am at a loss for words...
}
})
}
function generateActionTree() {
// this would be constructed through a bunch of real-world
// functions, queuing up all the actions
// and pointing outputs to inputs.
return [
{ action: 'create', table: 'tb1', set: 'key1' },
{ action: 'create', table: 'tb2', set: 'key2' },
{ action: 'create', table: 'tb2', set: 'key3' },
{ action: 'create', table: 'tb3', set: 'key5' },
{ action: 'create', table: 'tb4', set: 'key6' },
{ action: 'create', table: 'tb3', set: 'key7' },
{ action: 'create', table: 'tb4', set: 'key8' },
{ action: 'create', table: 'tb3', set: 'key9' },
{ action: 'create', table: 'tb3', set: 'key10', input: {
y: { type: 'binding', path: ['key6', 'foo'] },
z: { type: 'binding', path: ['key1', 'bar'] }
} },
{ action: 'create', table: 'tb1', set: 'key4', input: {
x: { type: 'binding', path: ['key2', 'baz'] }
} },
{ action: 'create', table: 'tb4', set: 'key11', input: {
a: { type: 'binding', path: ['key10', 'z'] },
b: { type: 'binding', path: ['key1', 'bar'] }
} },
{ action: 'create', table: 'tb1', set: 'key12', input: {
p: { type: 'binding', path: ['key10', 'z'] },
q: { type: 'binding', path: ['key11', 'a'] }
} },
{ action: 'create', table: 'tb4', set: 'key13' },
{ action: 'create', table: 'tb3', set: 'key14' },
{ action: 'create', table: 'tb4', set: 'key15', input: {
a: { type: 'binding', path: ['key8', 'z'] },
} },
{ action: 'create', table: 'tb5', set: 'key16' },
{ action: 'create', table: 'tb6', set: 'key17' },
{ action: 'create', table: 'tb6', set: 'key18', input: {
m: { type: 'binding', path: ['key4', 'x'] },
} },
{ action: 'create', table: 'tb6', set: 'key19', input: {
m: { type: 'binding', path: ['key4', 'x'] },
n: { type: 'binding', path: ['key13', 'a'] },
} },
{ action: 'create', table: 'tb6', set: 'key20', input: {
m: { type: 'binding', path: ['key18', 'm'] },
n: { type: 'binding', path: ['key17', 'm'] },
} },
{ action: 'create', table: 'tb1', set: 'key21' },
{ action: 'create', table: 'tb2', set: 'key22', input: {
w: { type: 'binding', path: ['key18', 'm'] },
x: { type: 'binding', path: ['key17', 'm'] },
} },
{ action: 'create', table: 'tb2', set: 'key23' },
{ action: 'create', table: 'tb3', set: 'key24' },
]
}
I think this is roughly topological sorting, but not quite sure how to apply it to this specific situation.
Your data structure isn't clear to me. What are the single letter ids p
, q
, etc.? I also don't understand the role of tables. You can insert multiple items in the same table in one write, no? I'm assuming these tihngs don't matter in the root sequencing problem.
I'm treating the set
field as a "job" and the corresponding keys mentioned in the inputs
as dependencies: jobs that must be completed before it.
I don't have time to be thorough here, and I don't have a javascript environment handy, so this is in Python.
Let's extract a dependency graph from the the verbose data structure. Then look for "levels." The first level is all nodes with no dependencies. The second is all nodes with dependencies met in any previous level, etc. Rinse and repeate.
Note unlike I was thinking in my note in comments, this is not a level graph by the traditional definition.
Also, I'm not going to bother with data structures to make this efficient. You could do it in O(n log n) time. My code is O(n^2).
Sorry if I'm misinterpreting your question. Also sorry for untested, possibly buggy implementation here.
from collections import defaultdict
def ExtractGraph(cmds):
"""Gets a dependency graph from verbose command input data."""
graph = defaultdict(set)
for cmd in cmds:
node = cmd['set']
graph[node].update(set())
inputs = cmd.get('input')
if inputs:
for _, val in inputs.items():
graph[node].add(val['path'][0])
return graph
def FindSources(graph):
"""Returns the nodes of the given graph having no dependencies."""
sources = set()
for node, edges in graph.items():
if not edges:
sources.add(node)
return sources
def GetLevels(dependencies):
"""Returns sequence levels satisfying given dependency graph."""
sources = FindSources(dependencies)
level = set(sources)
done = set(level)
todos = dependencies.keys() - done
levels = []
while level:
levels.append(level)
# Next level is jobs that have all dependencies done
new_level = set()
# A clever data structure could find the next level in O(k log n)
# for a level size of k and n jobs. This needs O(n).
for todo in todos:
if dependencies[todo].issubset(done):
new_level.add(todo)
todos.difference_update(new_level)
done.update(new_level)
level = new_level
return levels
cmds = [
{ 'action' : 'create', 'table' : 'tb1', 'set' : 'key1' },
{ 'action' : 'create', 'table' : 'tb2', 'set' : 'key2' },
{ 'action' : 'create', 'table' : 'tb2', 'set' : 'key3' },
{ 'action' : 'create', 'table' : 'tb3', 'set' : 'key5' },
{ 'action' : 'create', 'table' : 'tb4', 'set' : 'key6' },
{ 'action' : 'create', 'table' : 'tb3', 'set' : 'key7' },
{ 'action' : 'create', 'table' : 'tb4', 'set' : 'key8' },
{ 'action' : 'create', 'table' : 'tb3', 'set' : 'key9' },
{ 'action' : 'create', 'table' : 'tb3', 'set' : 'key10', 'input' : {
'y' : { 'type' : 'binding', 'path' : ['key6', 'foo'] },
'z' : { 'type' : 'binding', 'path' : ['key1', 'bar'] }
} },
{ 'action' : 'create', 'table' : 'tb1', 'set' : 'key4', 'input' : {
'x' : { 'type' : 'binding', 'path' : ['key2', 'baz'] }
} },
{ 'action' : 'create', 'table' : 'tb4', 'set' : 'key11', 'input' : {
'a' : { 'type' : 'binding', 'path' : ['key10', 'z'] },
'b' : { 'type' : 'binding', 'path' : ['key1', 'bar'] }
} },
{ 'action' : 'create', 'table' : 'tb1', 'set' : 'key12', 'input' : {
'p' : { 'type' : 'binding', 'path' : ['key10', 'z'] },
'q' : { 'type' : 'binding', 'path' : ['key11', 'a'] }
} },
{ 'action' : 'create', 'table' : 'tb4', 'set' : 'key13' },
{ 'action' : 'create', 'table' : 'tb3', 'set' : 'key14' },
{ 'action' : 'create', 'table' : 'tb4', 'set' : 'key15', 'input' : {
'a' : { 'type' : 'binding', 'path' : ['key8', 'z'] },
} },
{ 'action' : 'create', 'table' : 'tb5', 'set' : 'key16' },
{ 'action' : 'create', 'table' : 'tb6', 'set' : 'key17' },
{ 'action' : 'create', 'table' : 'tb6', 'set' : 'key18', 'input' : {
'm' : { 'type' : 'binding', 'path' : ['key4', 'x'] },
} },
{ 'action' : 'create', 'table' : 'tb6', 'set' : 'key19', 'input' : {
'm' : { 'type' : 'binding', 'path' : ['key4', 'x'] },
'n' : { 'type' : 'binding', 'path' : ['key13', 'a'] },
} },
{ 'action' : 'create', 'table' : 'tb6', 'set' : 'key20', 'input' : {
'm' : { 'type' : 'binding', 'path' : ['key18', 'm'] },
'n' : { 'type' : 'binding', 'path' : ['key17', 'm'] },
} },
{ 'action' : 'create', 'table' : 'tb1', 'set' : 'key21' },
{ 'action' : 'create', 'table' : 'tb2', 'set' : 'key22', 'input' : {
'w' : { 'type' : 'binding', 'path' : ['key18', 'm'] },
'x' : { 'type' : 'binding', 'path' : ['key17', 'm'] },
} },
{ 'action' : 'create', 'table' : 'tb2', 'set' : 'key23' },
{ 'action' : 'create', 'table' : 'tb3', 'set' : 'key24' },
]
dependencies = ExtractGraph(cmds)
levels = GetLevels(dependencies)
print(levels)
When run, this finds a few levels:
[
{'key9', 'key3', 'key24', 'key7', 'key17', 'key8', 'key21', 'key1',
'key5', 'key2', 'key16', 'key6', 'key23', 'key13', 'key14'},
{'key15', 'key4', 'key10'},
{'key19', 'key18', 'key11'},
{'key22', 'key12', 'key20'}
]
For a spot check, let's look at key12. It has 10 and 11 as dependencies. key10 has 6 and 1. key11 has 10 and 1. keys 1 and 6 have none. We find
Each job is being done as soon as its dependencies are met. So that's encouraging. More thorough testing is a must, however.
If you need to group these levels further into separate inserts per table, that's a post-processing step. The resulting inserts within a level can be done in parallel.