pythonobjectunicodecharacter-set

Initialize object for unicode fonts


I wrote a class object to access mathematical alphanumeric symbols from the unicode block as described on https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols

# San-serif
LATIN_SANSERIF_NORMAL_UPPER = (120224, 120250)
LATIN_SANSERIF_NORMAL_LOWER = (120250, 120276)
LATIN_SANSERIF_BOLD_UPPER = (120276, 120302)
LATIN_SANSERIF_BOLD_LOWER = (120302, 120328)
LATIN_SANSERIF_ITALIC_UPPER = (120328, 120354)
LATIN_SANSERIF_ITALIC_LOWER = (120354, 120380)
LATIN_SANSERIF_BOLDITALIC_UPPER = (120380, 120406)
LATIN_SANSERIF_BOLDITALIC_LOWER = (120406, 120432)

class MathAlphanumeric:
    def __init__(self, script, font, style, case):
        self.script = script
        self.font = font
        self.style = style
        self.case = case
        
    def charset(self):
        start, end = eval('_'.join([self.script, self.font, self.style, self.case]).upper())
        for c in range(start, end):
            yield chr(c)
    
    @staticmethod
    def supported_scripts():
        return {'latin', 'greek', 'digits'}
    
    @staticmethod
    def supported_fonts():
        return {'serif', 'sanserif', 'calligraphy', 'fraktor', 'monospace', 'doublestruck'}
    
    @staticmethod
    def supported_style():
        return {'normal', 'bold', 'italic', 'bold-italic'}
    
    @staticmethod
    def supported_case():
        return {'upper', 'lower'}
         

And to use it, I'll do:

ma = MathAlphanumeric('latin', 'sanserif', 'bold', 'lower')
print(list(ma.charset()))

[out]:

['𝗮', '𝗯', '𝗰', '𝗱', '𝗲', '𝗳', '𝗴', '𝗵', '𝗶', '𝗷', '𝗸', '𝗹', '𝗺', '𝗻', '𝗼', '𝗽', '𝗾', '𝗿', '𝘀', '𝘁', '𝘂', '𝘃', '𝘄', '𝘅', '𝘆', '𝘇']

The code works as expected but to cover all the mathematical alphanum symbols, I'll have to to enumerate through all the start and end symbols from the script * fonts * style * case no. of constants.

My questions are:


Solution

  • You may be interested in the unicodedata standard library, scpecifically :

    A quick example :

    >>> import unicodedata
    >>> unicodedata.name(chr(0x1d5a0))
    'MATHEMATICAL SANS-SERIF CAPITAL A'
    >>> unicodedata.lookup("MATHEMATICAL SANS-SERIF CAPITAL A")
    '𝖠'
    >>> unicodedata.name(chr(0x1d504))
    'MATHEMATICAL FRAKTUR CAPITAL A'
    >>> unicodedata.lookup("MATHEMATICAL FRAKTUR CAPITAL A")
    '𝔄'
    

    Now you have to find all the names that unicodedata expects for your use cases, construct the corresponding string from them, and call lookup.

    Here is a mini proof-of-concept :

    import unicodedata
    import string
    
    
    def charset(script: str, font: str, style: str, case: str):
        features = ["MATHEMATICAL"]
        # TODO: use script
        assert font in MathAlphanumeric.supported_fonts(), f"invalid font {font!r}"
        features.append(font.upper())
        assert style in MathAlphanumeric.supported_style(), f"invalid style {style!r}"
        if style != "normal":
            if font == "fraktur":
                features.insert(-1, style.upper())  # "bold" must be before "fraktur"
            elif font in ("monospace", "double-struck"):
                pass  # it has only one style, and it is implicit
            else:
                features.append(style.upper())
        assert case in MathAlphanumeric.supported_case(), f"invalid case {case!r}"
        features.append("CAPITAL" if case == "upper" else "SMALL")
        return tuple(unicodedata.lookup(" ".join(features + [letter]), ) for letter in string.ascii_uppercase)
    
    
    if __name__ == '__main__':
        print("".join(charset("latin", "sans-serif", "bold", "lower")))
        # 𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇
        print("".join(charset("latin", "fraktur", "bold", "upper")))
        # 𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅
        print("".join(charset("latin", "monospace", "bold", "upper")))
        # 𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉
        print("".join(charset("latin", "double-struck", "bold", "upper")))
        # KeyError: "undefined character name 'MATHEMATICAL DOUBLE-STRUCK CAPITAL C'"
    

    (and I changed a bit your supported_fonts method : return {'serif', 'sans-serif', 'calligraphy', 'fraktur', 'monospace', 'double-struck'})

    But there are a lot of caveats in Unicode : it holds all the glyphs you could possibly want, but not organized in a coherent way (due to historical reasons). The failure in my example is caused by :

    >>> unicodedata.name("𝔅")  # the letter copied from the Wikipedia page
    'MATHEMATICAL FRAKTUR CAPITAL B'
    >>> unicodedata.name("ℭ")  # same, but for C
    'BLACK-LETTER CAPITAL C'
    

    So you will need a lot of special cases.

    Also :