gf

Sentence-initial capitalization, lexing and GF


I know that, in the GF shell, I can use the command put_string along with the lexer lextext to convert a string into a list of tokens (which can then be parsed). This lexer “understands” punctuation and, crucially, it also converts what it thinks is sentence-initial punctuation into lower case:

> put_string -lextext "This is my house."
this is my house .

However, if my grammar contains tokens with initial capitals (which is not at all unusual), and if such a token happens to appear at the beginning of a sentence, the lexer lower-cases it, making the token list unparsable by my grammar:

> put_string -lextext "France is my home."
france is my home .

My question: Is there any way to make this or any other lexer aware in more detail of which tokens it should and shouldn’t lower-case? Perhaps by making the lexer “see” the grammar for which it is lexing?

Or are lexers not in vogue any more, and am I expected to handle these things (sentence-initial capitalization, punctuation, binding tokens) on my own, completely outside GF? How do people handle these things in their GF applications usually?


Solution

  • There is no such functionality in GF, and afaik no plans have been discussed to include it. Usually people use one or more of the following strategies:

    You can also make your grammar use the CAPIT and ALL_CAPIT tokens described in Angelov (2015), that could solve some problems (and maybe create new ones).