pythondataframedata-generation

Measuring the goodness quality of VAE-generated tabular data


I have the following DataFrame in pandas python:

index time flag date number
0 2.7584 0 2.91844 3
1 1.1234 1 3.58941 4
2 5.8583 1 1.81801 5
... ... ... ... ...
305 1.0493 0 1.98321 1

I am using a Variational Autoencoder to generate new data from the previous DataFrame. Example for 500 generated data:

index time flag date number
0 1.9483 0.9483 1.49302 2.9489
1 2.9849 1.0849 2.28347 3.8472
2 0.8329 1.8218 3.23432 5.0192
... ... ... ... ...
499 -0.2181 0.0918 1.2382 0.98493

I would like to know if there is any metric or function already implemented in python to be able to measure the goodness of the data generated in the second DataFrame given the first one.

I don't need it to be a very complex metric, just that given the original and generated dataframe, it tells me how good the generated data is or how similar it is to the original.


Solution

  • Assuming the two data frames have the same shape, and you want to compare them element-wise (ie, compare the rows in order), then the simplest approach is to use numpy.linalg.norm to compute the norm of each column after subtracting the two dataframes. The type of norm is specified with the ord parameter; the most common choices are:

    In all cases, a small norm means the columns are "close", where the precise definition of "close" depends on the choice of norm, and a value of 0 means they are identical.

    After computing the norm of each column, you can then combine them by scaling and adding them together. You might have columns that have much larger numeric values or whose "closeness" doesn't matter as much, so you'll want to scale that column's norm down so it doesn't impact the total as much. Similarly, you might have a column whose numeric values are small, or a column whose "closeness" is more important, so you will want to provide a higher weight.

    Here's how it looks using the first 3 rows of your data:

    import numpy as np
    
    col_names = ["time", "flag", "date", "number"]
    data_set_0 = [
        [2.7584, 0, 2.91844, 3],
        [1.1234, 1, 3.58941, 4],
        [5.8583, 1, 1.81801, 5],
    ]
    data_set_1 = [
        [1.9483, 0.9483, 1.49302, 2.9489],
        [2.9849, 1.0849, 2.28347, 3.8472],
        [0.8329, 1.8218, 3.23432, 5.0192],
    ]
    pd0 = pd.DataFrame(data_set_0, columns=col_names)
    pd1 = pd.DataFrame(data_set_1, columns=col_names)
    print("2-norm of cols:  ", np.linalg.norm(pd0-pd1, axis=0))
    print("1-norm of cols:  ", np.linalg.norm(pd0-pd1, ord=1, axis=0))
    print("inf-norm of cols:", np.linalg.norm(pd0-pd1, ord=np.inf, axis=0))
    

    This prints:

    2-norm of cols:   [5.41997135 1.25771067 2.39650485 0.1622581 ]
    1-norm of cols:   [7.697   1.855   4.14767 0.2231 ]
    inf-norm of cols: [5.0254  0.9483  1.42542 0.1528 ]
    

    A weighting can be applied:

    weights = [1, 5, 2, 1]
    print("2-norm:   ", np.inner(weights, np.linalg.norm(pd0-pd1, axis=0)))
    print("1-norm:   ", np.inner(weights, np.linalg.norm(pd0-pd1, ord=1, axis=0)))
    print("inf-norm: ", np.inner(weights, np.linalg.norm(pd0-pd1, ord=np.inf, axis=0)))
    

    This weighting makes the flag differences 5 times more important than the time and number differences, and the date differences two times more important. This prints:

    2-norm:    16.66379250817543
    1-norm:    25.49044
    inf-norm:  12.77054