I know that, in the GF shell, I can use the command put_string
along with the lexer lextext
to convert a string into a list of tokens (which can then be parsed). This lexer “understands” punctuation and, crucially, it also converts what it thinks is sentence-initial punctuation into lower case:
> put_string -lextext "This is my house."
this is my house .
However, if my grammar contains tokens with initial capitals (which is not at all unusual), and if such a token happens to appear at the beginning of a sentence, the lexer lower-cases it, making the token list unparsable by my grammar:
> put_string -lextext "France is my home."
france is my home .
My question: Is there any way to make this or any other lexer aware in more detail of which tokens it should and shouldn’t lower-case? Perhaps by making the lexer “see” the grammar for which it is lexing?
Or are lexers not in vogue any more, and am I expected to handle these things (sentence-initial capitalization, punctuation, binding tokens) on my own, completely outside GF? How do people handle these things in their GF applications usually?
There is no such functionality in GF, and afaik no plans have been discussed to include it. Usually people use one or more of the following strategies:
pg -words
and use that to make your own capitalisation rules (basically implementing the feature that you asked for, but separately for each grammar!)You can also make your grammar use the CAPIT
and ALL_CAPIT
tokens described in Angelov (2015), that could solve some problems (and maybe create new ones).