Preamble
I am creating my own JSON lexer and eventually a full parser, purely as a learning experience because that is what I enjoy doing. As I understand it, the lexer's job is to tokenize the data (break the stream down into individual tokens) as well as determine what type of data each token is (true, false, string, number, etc).
I have written a basic lexer that can handle numbers correctly per the specs at json.org. I decided that some rules can and should be validated by the lexer, for example:
0
' and '0.1
' are valid, '01
' is not)0.1e1.5
' is not valid)etc.
This makes sense to me because the lexer's job is to identify tokens. Despite it looking almost correct, invalid JSON data like '01
' is not a token, much like ';%*
', which is clearly garbage, is not a token.
Also this way, invalid JSON like '{ "age": 0001 }
' is found to be invalid at the earliest stage and processing can abort immediately with an undefined token.
The question
My question concerns reading string values in the lexer.
My string reading logic can escape quotation marks fine (ie '"some \"text\""
'). This was obvious, because an escaped quotation mark should not end the string token. I am aiming to correctly support all the other escapable characters, including 4-hex digit unicode ('\u01ab
').
However, I am unsure if identifying and validating escaped sequences like this should be the responsibility of the lexer, or if that should come later. Should the lexer guarantee that a STRING token is completely valid?
Take the invalid JSON '"\u012"
' for example. It has only 3 hex digits following the \u
characters and so is not valid. Validating this string on jsonlint.com gives the following error:
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got 'undefined'
I understand this to be a failure during the lexing process, as the parser was expecting one of the listed tokens, but the lexer identified the invalid string and returned an undefined token. On the contrary, if I attempt to unescape the same string value on freeformatter.com/json-escape, the error message gives the impression that the lexer yielded a string token and the failure occurred during parsing/interpreting the data
Unable to process your input string. Unable to parse unicode value: 012"
I am more inclined to believe jsonlint.com is likely correct here, and it does make sense that an invalid escaped string would not be considered a valid token. Any input would be greatly appreciated
I thought I had done adequate research before making this post but it wasn't until I submitted that I found 'related questions' which already cover the topic. Is it a Lexer's Job to Parse Numbers and Strings? & Where should I draw the line between lexer and parser? sufficiently answer my question.
Apologies for the duplicate