pythonhuman-readable

bytes to human readable, and back. without data loss


I need to convert strings which contain the memory usage in bytes, like: 1048576 (which is 1M) into exactly that, a human-readable version, and visa-versa.

Note: I looked here already: Get a human-readable version of a file size

And here (even though it isn't python): How to convert human readable memory size into bytes?

Nothing so far helped me, so I looked elsewhere.

I have found something that does this for me here: http://code.google.com/p/pyftpdlib/source/browse/trunk/test/bench.py?spec=svn984&r=984#137

The Code:

def bytes2human(n, format="%(value)i%(symbol)s"):
    """
    >>> bytes2human(10000)
    '9K'
    >>> bytes2human(100001221)
    '95M'
    """
    symbols = ('B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y')
    prefix = {}
    for i, s in enumerate(symbols[1:]):
        prefix[s] = 1 << (i+1)*10
    for symbol in reversed(symbols[1:]):
        if n >= prefix[symbol]:
            value = float(n) / prefix[symbol]
            return format % locals()
    return format % dict(symbol=symbols[0], value=n)

And also a function for conversion the other way (same site):

def human2bytes(s):
    """
    >>> human2bytes('1M')
    1048576
    >>> human2bytes('1G')
    1073741824
    """
    symbols = ('B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y')
    letter = s[-1:].strip().upper()
    num = s[:-1]
    assert num.isdigit() and letter in symbols
    num = float(num)
    prefix = {symbols[0]:1}
    for i, s in enumerate(symbols[1:]):
        prefix[s] = 1 << (i+1)*10
    return int(num * prefix[letter])

This is great and all, but it has some information loss, example:

>>> bytes2human(10000)
'9K'
>>> human2bytes('9K')
9216

To try to solve this, I change the formatting on the function bytes2human

Into: format="%(value).3f%(symbol)s")

Which is much nicer, giving me these results:

>>> bytes2human(10000)
'9.766K'

but when I try to convert them back with the human2bytes function:

>>> human2bytes('9.766K')

Traceback (most recent call last):
  File "<pyshell#366>", line 1, in <module>
    human2bytes('9.766K')
  File "<pyshell#359>", line 12, in human2bytes
    assert num.isdigit() and letter in symbols
AssertionError

This is because of the .

So my question is, how can I convert a human-readable version back into byte-version, without data-loss?

Note: I know that 3 decimal places is also a little bit of data loss. But for the purposes of this question, lets ignore that for now, I can always change that to something greater.


Solution

  • So it turns out the answer was much simpler than I thought - one of the links that I provided actually led to a much more detailed version of the function:

    Which is able to deal with any scope I give it.

    But thank you for your help:

    The code copied here for posterity:

    ## {{{ http://code.activestate.com/recipes/578019/ (r15)
    #!/usr/bin/env python
    
    """
    Bytes-to-human / human-to-bytes converter.
    Based on: http://goo.gl/kTQMs
    Working with Python 2.x and 3.x.
    
    Author: Giampaolo Rodola' <g.rodola [AT] gmail [DOT] com>
    License: MIT
    """
    
    # see: http://goo.gl/kTQMs
    SYMBOLS = {
        'customary'     : ('B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y'),
        'customary_ext' : ('byte', 'kilo', 'mega', 'giga', 'tera', 'peta', 'exa',
                           'zetta', 'iotta'),
        'iec'           : ('Bi', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi', 'Yi'),
        'iec_ext'       : ('byte', 'kibi', 'mebi', 'gibi', 'tebi', 'pebi', 'exbi',
                           'zebi', 'yobi'),
    }
    
    def bytes2human(n, format='%(value).1f %(symbol)s', symbols='customary'):
        """
        Convert n bytes into a human readable string based on format.
        symbols can be either "customary", "customary_ext", "iec" or "iec_ext",
        see: http://goo.gl/kTQMs
    
          >>> bytes2human(0)
          '0.0 B'
          >>> bytes2human(0.9)
          '0.0 B'
          >>> bytes2human(1)
          '1.0 B'
          >>> bytes2human(1.9)
          '1.0 B'
          >>> bytes2human(1024)
          '1.0 K'
          >>> bytes2human(1048576)
          '1.0 M'
          >>> bytes2human(1099511627776127398123789121)
          '909.5 Y'
    
          >>> bytes2human(9856, symbols="customary")
          '9.6 K'
          >>> bytes2human(9856, symbols="customary_ext")
          '9.6 kilo'
          >>> bytes2human(9856, symbols="iec")
          '9.6 Ki'
          >>> bytes2human(9856, symbols="iec_ext")
          '9.6 kibi'
    
          >>> bytes2human(10000, "%(value).1f %(symbol)s/sec")
          '9.8 K/sec'
    
          >>> # precision can be adjusted by playing with %f operator
          >>> bytes2human(10000, format="%(value).5f %(symbol)s")
          '9.76562 K'
        """
        n = int(n)
        if n < 0:
            raise ValueError("n < 0")
        symbols = SYMBOLS[symbols]
        prefix = {}
        for i, s in enumerate(symbols[1:]):
            prefix[s] = 1 << (i+1)*10
        for symbol in reversed(symbols[1:]):
            if n >= prefix[symbol]:
                value = float(n) / prefix[symbol]
                return format % locals()
        return format % dict(symbol=symbols[0], value=n)
    
    def human2bytes(s):
        """
        Attempts to guess the string format based on default symbols
        set and return the corresponding bytes as an integer.
        When unable to recognize the format ValueError is raised.
    
          >>> human2bytes('0 B')
          0
          >>> human2bytes('1 K')
          1024
          >>> human2bytes('1 M')
          1048576
          >>> human2bytes('1 Gi')
          1073741824
          >>> human2bytes('1 tera')
          1099511627776
    
          >>> human2bytes('0.5kilo')
          512
          >>> human2bytes('0.1  byte')
          0
          >>> human2bytes('1 k')  # k is an alias for K
          1024
          >>> human2bytes('12 foo')
          Traceback (most recent call last):
              ...
          ValueError: can't interpret '12 foo'
        """
        init = s
        num = ""
        while s and s[0:1].isdigit() or s[0:1] == '.':
            num += s[0]
            s = s[1:]
        num = float(num)
        letter = s.strip()
        for name, sset in SYMBOLS.items():
            if letter in sset:
                break
        else:
            if letter == 'k':
                # treat 'k' as an alias for 'K' as per: http://goo.gl/kTQMs
                sset = SYMBOLS['customary']
                letter = letter.upper()
            else:
                raise ValueError("can't interpret %r" % init)
        prefix = {sset[0]:1}
        for i, s in enumerate(sset[1:]):
            prefix[s] = 1 << (i+1)*10
        return int(num * prefix[letter])
    
    
    if __name__ == "__main__":
        import doctest
        doctest.testmod()
    ## end of http://code.activestate.com/recipes/578019/ }}}