regexbnf

Regular expression for a language tag (as defined by BCP47)


I need a regular expression for a language tag as defined by BCP 47.

I know that the full BNF syntax is available at http://www.rfc-editor.org/rfc/bcp/bcp47.txt and that I could use it to write my own, but hopefully there is one already out there.


Solution

  • Looks like this:

    ^((?<grandfathered>(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|
    i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)|(art-lojban|
    cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang))|((?<language>
    ([A-Za-z]{2,3}(-(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})
    (-(?<script>[A-Za-z]{4}))?(-(?<region>[A-Za-z]{2}|[0-9]{3}))?(-(?<variant>[A-Za-z0-9]{5,8}
    |[0-9][A-Za-z0-9]{3}))*(-(?<extension>[0-9A-WY-Za-wy-z](-[A-Za-z0-9]{2,8})+))*
    (-(?<privateUse>x(-[A-Za-z0-9]{1,8})+))?)|(?<privateUse>x(-[A-Za-z0-9]{1,8})+))$
    

    Here is the code to generate it (in C#):

    var regular = "(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang)";
    var irregular = "(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)";
    var grandfathered = "(?<grandfathered>" + irregular + "|" + regular + ")";
    var privateUse = "(?<privateUse>x(-[A-Za-z0-9]{1,8})+)";
    var singleton = "[0-9A-WY-Za-wy-z]";
    var extension = "(?<extension>" + singleton + "(-[A-Za-z0-9]{2,8})+)";
    var variant = "(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})";
    var region = "(?<region>[A-Za-z]{2}|[0-9]{3})";
    var script = "(?<script>[A-Za-z]{4})";
    var extlang = "(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2})";
    var language = "(?<language>([A-Za-z]{2,3}(-" + extlang + ")?)|[A-Za-z]{4}|[A-Za-z]{5,8})";
    var langtag = "(" + language + "(-" + script + ")?" + "(-" + region + ")?" + "(-" + variant + ")*" + "(-" + extension + ")*" + "(-" + privateUse + ")?" + ")";
    var languageTag = @"^(" + grandfathered + "|" + langtag + "|" + privateUse + ")$";
    
    Console.WriteLine(languageTag);
    

    I cannot guarantee its correctness (I may have made typos), but it works fine on the examples in Appendix A.

    Depending on your environment, you might need to remove the named capturing groups "?<...>".