pythonmacosunicodeunicode-normalizationhfs+

How to convert path to Mac OS X path, the almost-NFD normal form?


Macs normally operate on the HFS+ file system which normalizes paths. That is, if you save a file with accented é in it (u'\xe9') for example, and then do a os.listdir you will see that the filename got converted to u'e\u0301'. This is normal unicode NFD normalization that the Python unicodedata module can handle. Unfortunately HFS+ is not fully consistent with NFD, meaning some paths will not be normalized, for example 福 (u'\ufa1b') will not be changed, although its NFD form is u'\u798f'.

So, how to do the normalization in Python? I would be fine using native APIs as long as I can call them from Python.


Solution

  • Well, decided to write out the Python solution, since the related other question I pointed to was more Objective-C.

    First you need to install https://pypi.python.org/pypi/pyobjc-core and https://pypi.python.org/pypi/pyobjc-framework-Cocoa. Then following should work:

    import sys
    
    from Foundation import NSString, NSAutoreleasePool
    
    def fs_normalize(path):
        _pool = NSAutoreleasePool.alloc().init()
        normalized_path = NSString.fileSystemRepresentation(path)
        upath = unicode(normalized_path, sys.getfilesystemencoding() or 'utf8')
        return upath
    
    if __name__ == '__main__':
        e = u'\xe9'
        j = u'\ufa1b'
        e_expected = u'e\u0301'
    
        assert fs_normalize(e) == e_expected
        assert fs_normalize(j) == j
    

    Note that NSString.fileSystemRepresentation() seems to also accept str input. I had some cases where it was returning garbage in that case, so I think it would be just safer to use it with unicode. It always returns str type, so you need to convert back to unicode.