I am working on a mac computer using bash commands via terminal.
I am running a DNA sequencer that generates ~3-5 million files over the course of 48 hours. For speed reasons these files are saved to the computer's SSD. I would like to use fswatch and rsync commands to monitor the directory and transfer these files to a server as they are being generated to reduce the long transfer times post sequencing.
Here is the command I have come up with.
fswatch -o ./ | (while read; do rsync -r -t /Source/Directory /Destination/Directory; done)
But I am worried that due to the large number of files >3 million and large total size > 100gb these tools might struggle to keep up. Is there a better strategy?
Thanks for your help!
The command you would use might work but would have some performance issues that I would want to avoid.
This would mean that for each line outputted by "fswatch" there would be one "rsync" instance started, while the duration of "rsync" would be larger and larger.
48 hours is a lot of time and copying the files (~100GB) wouldn't take so long anyway (disk to disk is very fast, over gigabit network is also very fast).
Instead I would propose an execution rsync -a --delete /source /destination
at regular intervals (ex. 30 minutes) during the generation process and once at the end, to be sure nothing is missed. A short script could contain:
#!/bin/bash
while ps -ef | grep -q "process that generates files"; do
echo "Running rsync..."
rsync -a --delete /source /destination
echo "...waiting 30 minutes"
sleep 1800 # seconds
done
echo "Running final rsync..."
rsync -a --delete /source /destination
echo "...done."
...just replace the "process that generates files" with whatever name the process that generates files looks like in the "ps -ef" output while is it running. Adjust time as you see fit, I considered that in 30 minutes ~2GB of data are created which can be copied in a couple of minutes.
The script would ensure that "rsync" doesn't run more times then it should and it would focus into copying files instead of comparing the source and destination to often.
The option "-a" (archive) would imply the options you use and more (-rlptgoD), the "--delete" would remove any file that exists on "/destination" but doesn't exist on "/source" (handy in case of temporary files that were copied but not actually needed in the final structure).