I am currently developing an application in Javascript in which I will allow the users to add "rules" to the application by the use of a small self-declared programming language. In order to achieve this functionality, I need to be able to parse strings and extract the necessary information. Here are some examples of my language:
Example 1:
SET backgroundColor
MODULE someModule
TO red
WHEN someVariable == 1
Example 2:
SET textSize
TO someObject.size
WHEN 5 + 5 == 10
Example 3:
REMOVE backgroundColor
MODULE someModule
ID 5
Note that although I am making use of newlines in my examples, these rules can also just be formatted to one long string without any newlines.
As you can observe, it is an SQL-like language in which I make use of capitalized keywords. There are several combinations of keywords possible, just like SQL, but it is definetly not a huge language. After each keyword, the user can just write any simple Javascript expressions. This is important. I know one should usually write a parser, but in this case I do not think it is appropriate to reinvent the wheel and write a parser that can parse Javascript. Especially because, apart from these Javascript expressions, the language is rather simple/limited, it would be ideal if there is a more simple approach to tackling this problem.
I have already implemented functions that will take the necessary information as their parameters and add the rule to my system. What is left is to fill the gap. How do I efficiently verify a valid syntax and extract all the required information from the string I receive such that I can fill in a function like this:
addRuleToTheSystem('backgroundColor', 'someModule', 'red', 'someVariable == 1')
First, I'm legally required to warn you about the security dangers of using eval
. With that out of the way:
If you can force users to never use your language keywords in their expressions, your parsing becomes very straightforward. If I read the grammar correctly, it can even be parsed by a regular expression. Here's an example of parsing a SET MODULE TO WHEN rule:
const regexp = /^SET (?<set>.+?) MODULE (?<module>.+?) TO (?<to>.+?) WHEN (?<when>.+?)$/gs;
const userRule = "SET backgroundColor MODULE someModule TO red WHEN someVariable == 1";
regexp.exec(userRule).groups
// Object { set: "backgroundColor", module: "someModule", to: "red", when: "someVariable == 1" }
This is more restrictive than your intended language (e.g. SET a MODULE MODULE.name TO "WHEN"
will parse incorrectly), but your uppercase keywords help against simple mistakes, and you can catch more mistakes by ensuring at most one of each keyword exists.
What syntax can you ignore? Operators, declarations, control flow, etc. What syntax can't you ignore? Anything that might accidentally contain your keyword (names, literal strings, comments, regexps).
This is harder than it sounds once you take into account all the different ways to declare strings, escape sequences, optional whitespace, etc. But it's doable, and doesn't require parsing the full JS syntax.
To reduce false positives, and to give users an escape hatch, consider ignoring keywords that appear inside balanced parenthesis/braces/brackets, as in:
SET backgroundColor MODULE (MODULE.name) TO (MODULE.color)
This allows the user to have a JS variable "MODULE" without conflicting with your keyword.
Acorn looks like a good option, and has a handy function:
parseExpressionAt(input, offset, options)
will parse a single expression in a string, and return its AST. It will not complain if there is more of the string left after the expression.
By enabling the locations
option, you can then alternate between looking for your keywords and handing over the rest of the string to Acorn to find an expression. Then look for a keyword after the last location, and repeat.