algorithmarchitecturefile-conversionobject-oriented-analysissystem-design

How does file convertors work in general like word to pdf, XML to json, word to txt etc


I've used many types of file convertor like word to pdf, XML to json, word to txt etc. How do they work in backend? Is there some specific guidelines each of them follow? Are there some similarity in the way they are implemented.

I tried searching it but most of the articles take me to the web app that can convert the doc, but none of them gives clarity on how it's done.


Solution

  • All of them work by parsing the first document into a data structure. Then generate a document in the other format from that data structure using recursion.

    Parsing itself is a giant topic that people take courses on in computer science. But long story short, it proceeds by breaking the document into tokens, and then fitting the tokens into a parse tree using one of a standard set of methods. They have all sorts of fancy names like Recursive Descent and LALR(1). That's where most of the theory you'd want to learn is.

    For example if you're writing a JSON to XML converter, you'd first need to parse that JSON. A JSON Parser shows how you could write that, from scratch, using recursive descent. Once written you just need to write a recursive function that takes each data type and does something appropriate with it to generate text in the format that you want.

    Incidentally you can also write a "document converter" that converts from a document format to the same document format. Why would someone want to do that? The two most common use cases are to prettify or minify code. Despite the fact that only one format is being dealt with, the principles of how you do it are exactly the same.