I've been asked to design a batch application that would retrieve data (specifically, a detailed list of transactions) from an external vendor on a periodic basis. We have agreed to use XML for the data exchange, but we are investigating different methods/protocols to facilitate the actual data transfer. The vendor suggested email or FTP as a means to transfer the data, but we rejected the first option out-right due to logistics and reliability concerns.
As for the second, FTP, I have always been hesitant to use FTP in a production environment where reliability is a concern. A design whereby a vendor publishes files to an FTP to be periodically pulled down seems unreliable and error-prone. My initial reaction would be to gravitate towards something like a web service (which this particular vendor may or may not even be able or willing to provide), where the data could be queried, as needed, for a specific time period.
In general, what is the best approach to use in a situation such as this? Is FTP (or SFTP) generally considered to be an acceptable option, or is there something better? Is a web-service overkill for such a simple exchange of data? Are there other viable options that I am completely overlooking?
File transfer presents a number of complications.
I would prefer a web service, or just HTTPS access to the file with digest/basic authentication, but for very large files, that may not be practical for them.
Another answer could be to use a shared bucket on Amazon S3, where you have read access, and they have write access. I have used that a couple of times as a poor man's secure file transfer.
I have used flavors of FTP in this way, and here are some tips if you do:
Use a secure version like SFTP - FTP is just not secure for the credentials or data.
Use a semaphore file to indicate when the latest file is complete and available, or make sure that when they write the file to the FTP directory, they move it in whole, so you do not access incomplete files.
Make sure each file has a unique file name (timestamp, sequence number, etc.) so you can keep track of which you have processed and which you haven't. Do not reuse the file name, as you do not know when you have processed already, and could get a race condition of the file is updated as you are accessing it.
Use a hash value to check for successful transfer. They could provide an MD5 hash for the file, and then you could check this against your version once you have completed copying it. I have often used the MD5 file as a semaphore as well, to both indicate a file is available, and provide a means to check the transfer was complete and correct.