apache-sparkhadoopflumeasn.1

Binary file conversion in distributed manner - Spark, Flume or any other option?


We have a scenario where there will be a continuous incoming set of binary files (ASN.1 type to be exact). We want to convert these binary files to different format, say XML or JSON, and write to a different location. I was wondering what would be the best architectural design to handle this kind of problem? I know we could use Spark cluster for CSV, JSON, parquet kind of files but I'm not sure we could use it for binary file processing, or we could use Apache Flume to move files from one place to another and even use interceptor to convert the contents.

It's ideal if we can switch the ASN.1 decoder whenever we have performance considerations without changing the underlying framework of distributed processing (ex: to use C++ based or python based or Java based decoder library).


Solution

  • In terms of scalability, reliability and future-proofing your solution, I'd look at Apache NiFi rather than Flume. You can start by developing your own ASN.1 Processor or try using the patch thats already available but not part of a released version yet.