parsingdomschemacontext-free-grammarconcrete-syntax-tree

Difference between a DOM tree parsing and a syntax tree parsing?


After parsing HTML or XML file, we can get the DOM tree.

After parsing C, C++, or JavaScript, we can get the Syntax tree.

Note that the syntax tree is constructed based on the context-free grammar which specifies a valid C/C++/JS program.

But it seems the DOM tree is just a pure hierarchy structure specified only by the HTML/XML file. Is that true? Is that the reason that Schema Validation has been done after parsing? What is the fundamental difference between these two kinds of parse trees?


Solution

  • Like any other language, XML is described by a grammar. XML's grammar is rather simple (start-tags, end-tags, correct nesting). So the syntax tree might seem simple as well (just an hierarchy of elements). An XML schema is another grammar that describes an XML file's content.

    So basically it's two parsers being invoked after each other. The first one verifies that all start-tags have an end-tag and that the nesting is right.

    The second parser verifies that the XML file's content is structured according to the schema (grammar).. like that an element named "B" can only be contained within an element named "A".

    This shouldn't be compared to parsing programming languages like C since you cannot change a programming language's syntax. If-statements can only appear within function bodies, not outside and you cannot change that. However in XML you can specify that "B"-elements can only appear within "A"-elements, or that "A"-elements can only appear within "B"-elements.. all by specifying the grammar of your XML file's content in form of a schema.