pythonstatisticsbayesianpymc3pymc

Fit a distribution from quantiles


I'm trying to replicate an example from SAS in Python where I fit a distribution from summary statistics. The summary statistics available to me are the total count, min, max, p50, p75, p85, p95, p98, p99, and p99.9. The measurements are coming from a distributed network of machines and consist of either latency or size distributions. The goal is to re-construct the mixture from each machine, and then combine those distributions to estimate the distribution of the entire network and do this on a regular interval in a streaming fashion.

I'm looking through the documentation of PyMC, Pyro and Pomegranate and get the general gist of mixture models, but the thing that I don't understand is how to setup the initial parameters for each distribution, which one to use given the data available to me, or how to shift each distribution to the corresponding quantile to construct the overall distribution.

Is this possible given any of these frameworks?


Solution

  • Answering my own question with some help from the Pyro forums. The code below contains the solution to the first half of the problem, finding a distribution that matches the parameters from the collected quantiles:

    import torch
    import torch.distributions as dist
    from torch.optim import Adam
    
    from typing import List, Tuple
    
    
    def find_cauchy_params(quantiles: List[Tuple[float, float]]):
        alpha = torch.tensor(1.0, requires_grad=True)
        beta = torch.tensor(1.0, requires_grad=True)
    
        quantile_tensors = [
            (quantile, torch.tensor(quantile_value))
            for quantile, quantile_value in quantiles
        ]
    
        def loss_fn():
            loss = 0.0
            d = dist.Cauchy(alpha, beta)
            for quantile, quantile_value in quantile_tensors:
                loss += (quantile - d.cdf(quantile_value)) ** 2
    
            return loss
    
        optim = Adam([alpha, beta], lr=0.01)
        for step in range(1000):
            optim.zero_grad()
            loss = loss_fn()
            print("loss", loss)
            loss.backward()
            optim.step()
    
        print("alpha = {}".format(alpha.item()))
        print("beta = {}".format(beta.item()))
    
    
    find_cauchy_params(
        [(0.5, 0.0), (0.75, 0.0), (0.95, 1.0), (0.98, 1.0), (0.99, 8.0), (0.999, 11.0)]
    )
    

    Truncated Output:

    ...
    loss tensor(0.0317, grad_fn=<AddBackward0>)
    loss tensor(0.0317, grad_fn=<AddBackward0>)
    loss tensor(0.0317, grad_fn=<AddBackward0>)
    loss tensor(0.0317, grad_fn=<AddBackward0>)
    alpha = -0.04828706011176109
    beta = 0.11657208949327469