pythonbayesian-networkssyntheticpomegranate

Saving pomegranate Bayesian Network models


I am making some rather big Bayesian Networks for generating synthetic data, and I find pomegranate to be a good alternative as it generates data quickly and easily allows for inputting evidence. I have one problem with it: saving the trained models. Pomegranate's built-in methods stores as json's so big that I run out of memory when I have 30 or so variables, even when using "lighter" algorithms. The models can not be pickled due to the error

TypeError: self.distributions_ptr,self.parent_count,self.parent_idxs cannot be converted to a Python object for pickling

I am wondering if anyone has a good alternative for storing pomegranate models, or else knows of a Bayesian Network library that generates data quickly after training. I would be grateful for any tips.


Solution

  • if your model can be learned and stored in the memory, it can be saved in a file, but maybe not by 'pickling'. There are many different formats for Bayesian networks (bif, xmlbif, dsl, uai, etc.). I don't know pomegranate, but there is certainly a way to read/save using such a format. With pyAgrum (of which I am one of the authors), you just have to write gum.saveBN(model, "model.xxx") to save it, and then bn=gum.loadBN("model.xxx") to read it ... You can choose xxx among all the supported format, for now : bif|dsl|net|bifxml|o3prm|uai (https://pyagrum.readthedocs.io/en/1.3.1/functions.html#pyAgrum.loadBN).

    As far as I understand, evidence for a sampling is just a way to filter the samples by keeping only the samples that respect the constraints (rejection sampling). There is no such a direct method in pyAgrum but this is can be done as a post-process :

    import pyAgrum as gum
    
    #create a BN with random CPTs
    bn=gum.fastBN("A->B{yes|maybe|no}<-C->D->E<-F<-B") 
    
    # generate a sample of size 100
    g=gum.BNDatabaseGenerator(bn)
    g.setRandomVarOrder()
    g.drawSamples(100)
    df=g.to_pandas()
    
    #filtering the dataframe
    rslt_df = df[(df['B'] == "yes") & 
                 (df['E'] == "1")] 
    

    And in a notebook :

    jupyter notebook