[SOLVED] ceph df max available miscalculation

ceph df max available miscalculation

Ceph cluster shows following weird behavior with ceph df output:

--- RAW STORAGE ---
CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
hdd    817 TiB  399 TiB  418 TiB   418 TiB      51.21
ssd    1.4 TiB  1.2 TiB   22 GiB   174 GiB      12.17
TOTAL  818 TiB  400 TiB  418 TiB   419 TiB      51.15

--- POOLS ---
POOL                     ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
pool1                    45   300   21 TiB    6.95M   65 TiB  20.23     85 TiB
pool2                    50    50   72 GiB  289.15k  357 GiB   0.14     85 TiB
pool3                    53    64  2.9 TiB  754.06k  8.6 TiB   3.24     85 TiB
erasurepool_data         57  1024  138 TiB   50.81M  241 TiB  48.49    154 TiB
erasurepool_metadata     58     8  9.1 GiB    1.68M   27 GiB   2.46    362 GiB
device_health_metrics    59     1   22 MiB      163   66 MiB      0     85 TiB
.rgw.root                60     8  5.6 KiB       17  3.5 MiB      0     85 TiB
.rgw.log                 61     8   70 MiB    2.56k  254 MiB      0     85 TiB
.rgw.control             62     8      0 B        8      0 B      0     85 TiB
.rgw.meta                63     8  7.6 MiB       52   32 MiB      0     85 TiB
.rgw.buckets.index       64     8   11 GiB    1.69k   34 GiB   3.01    362 GiB
.rgw.buckets.data        65   512   23 TiB   33.87M   72 TiB  21.94     85 TiB

As seen above available storage 399TiB, and max avail in pool list shows 85TiB. I use 3 replicas for each pool replicated pool and 3+2 erasure code for the erasurepool_data.

As far as I know Max Avail segment shows max raw available capacity according to replica size. So it comes up to 85*3=255TiB. Meanwhile cluster shows almost 400 available.

Which to trust? Is this only a bug?

Solution

MAX AVAIL column represents the amount of data that can be used before the first OSD becomes full. It takes into account the projected distribution of data across disks from the CRUSH map and uses the 'first OSD to fill up' as the target. it does not seem to be a bug. If MAX AVAIL is not what you expect it to be, look at the data distribution using ceph osd tree and make sure you have a uniform distribution.

You can also check some helpful posts here that explains some of the miscalculations:

As you have Erasure Coding involved please check this SO post:

ceph-df-octopus-shows-used-is-7-times-higher-than-stored-in-erasure-coded-pool