[SOLVED] How to set up effective bidirectional document change sync between MarkLogic DB and file system?

How to set up effective bidirectional document change sync between MarkLogic DB and file system?

MLCP could be used to export and import documents from file system to and from ML DB.

However it is ineffective to import and export everything. Only delta changes could be synced. How to do that?

The first question is how to detect a delta change (new, modify or delete) from both ML and file system.

I could use a file checksum (if the timestamp is not acceptable) to detect changes made on the file system. But how to do that with ML DB? Do I need to use some metadata property like [last change time] or [checksum] to detect that?

The next questions is how to set up orchestration sync. Should I explore Apache NIFI? (The assumption is to set up bi-directional sync for about once a day. Realtime sync is not required.)

What tools should I use? I believe I need to use MLCP as it is more efficent to handle bulk documents import and export.

Solution

You might not realize it but you are describing event driven architecture and perhaps more specifically event sourcing.

Essentially, you can shift your thinking to see each change on a filesystem or MarkLogic document as an "event". You need a way to capture when these events occur, a log to maintain events in the order they occurred and finally something to process and react to events that have already occurred.

In this case you need something to capture changes to MarkLogic documents (events), something to store those events and something to modify the filesystem to sync to the events. You also need something to capture filesystem changes (events), something to store a record of those events and something to modify MarkLogic to sync to the captured events.

You will need to plan out how to handle when the filesystem and MarkLogic make different changes at the same time to the same document(s). And you need to ensure that changing data in one system to sync with the other system doesn't create an infinite loop of one change bouncing back and forth between these systems.

At best, MarkLogic could capture a log of changes with something like a trigger, transform or API to react to when documents are modified and write a document capturing the change. But, you'll be on the hook to figure out the other components you need.

NiFi could get you further, but NiFi leans more toward being batch-based rather than real-time which could cause collisions of changes in your scenario and you're at the mercy of whether its plugins will do exactly what you need as they are typically not very customizable. NiFi will also be very challenging to maintain the order of changes that occurred independently. NiFi is also not very scalable in my experience with it but YMMV.

I would instead recommend that you look at adopting a message queue for your use case. In particular, I'd suggest taking a look at Kafka. A message queue would allow you to capture events in a central location and process them immediately (reducing the likelihood of change collisions) or at your convenience. The Kafka ecosystem makes it one of the most mature Message Queues as well as one of the simplest to integrate. It already has connectors to capture filesystem changes and the ability to create data pipelines.