[SOLVED] Measuring the goodness quality of VAE-generated tabular data

Measuring the goodness quality of VAE-generated tabular data

I have the following DataFrame in pandas python:

index	time	flag	date	number
0	2.7584	0	2.91844	3
1	1.1234	1	3.58941	4
2	5.8583	1	1.81801	5
...	...	...	...	...
305	1.0493	0	1.98321	1

I am using a Variational Autoencoder to generate new data from the previous DataFrame. Example for 500 generated data:

index	time	flag	date	number
0	1.9483	0.9483	1.49302	2.9489
1	2.9849	1.0849	2.28347	3.8472
2	0.8329	1.8218	3.23432	5.0192
...	...	...	...	...
499	-0.2181	0.0918	1.2382	0.98493

I would like to know if there is any metric or function already implemented in python to be able to measure the goodness of the data generated in the second DataFrame given the first one.

I don't need it to be a very complex metric, just that given the original and generated dataframe, it tells me how good the generated data is or how similar it is to the original.

Solution

Assuming the two data frames have the same shape, and you want to compare them element-wise (ie, compare the rows in order), then the simplest approach is to use numpy.linalg.norm to compute the norm of each column after subtracting the two dataframes. The type of norm is specified with the ord parameter; the most common choices are:

ord=2 the 2-norm (default for vectors) is the root mean squared (within a scale factor)
ord=1 the 1-norm is the sum of the absolute values (sum of absolute differences)
ord=np.inf the infinity-norm is the largest absolute value (ie: worst-case value)

In all cases, a small norm means the columns are "close", where the precise definition of "close" depends on the choice of norm, and a value of 0 means they are identical.

After computing the norm of each column, you can then combine them by scaling and adding them together. You might have columns that have much larger numeric values or whose "closeness" doesn't matter as much, so you'll want to scale that column's norm down so it doesn't impact the total as much. Similarly, you might have a column whose numeric values are small, or a column whose "closeness" is more important, so you will want to provide a higher weight.

Here's how it looks using the first 3 rows of your data:

import numpy as np

col_names = ["time", "flag", "date", "number"]
data_set_0 = [
    [2.7584, 0, 2.91844, 3],
    [1.1234, 1, 3.58941, 4],
    [5.8583, 1, 1.81801, 5],
]
data_set_1 = [
    [1.9483, 0.9483, 1.49302, 2.9489],
    [2.9849, 1.0849, 2.28347, 3.8472],
    [0.8329, 1.8218, 3.23432, 5.0192],
]
pd0 = pd.DataFrame(data_set_0, columns=col_names)
pd1 = pd.DataFrame(data_set_1, columns=col_names)
print("2-norm of cols:  ", np.linalg.norm(pd0-pd1, axis=0))
print("1-norm of cols:  ", np.linalg.norm(pd0-pd1, ord=1, axis=0))
print("inf-norm of cols:", np.linalg.norm(pd0-pd1, ord=np.inf, axis=0))

This prints:

2-norm of cols:   [5.41997135 1.25771067 2.39650485 0.1622581 ]
1-norm of cols:   [7.697   1.855   4.14767 0.2231 ]
inf-norm of cols: [5.0254  0.9483  1.42542 0.1528 ]

A weighting can be applied:

weights = [1, 5, 2, 1]
print("2-norm:   ", np.inner(weights, np.linalg.norm(pd0-pd1, axis=0)))
print("1-norm:   ", np.inner(weights, np.linalg.norm(pd0-pd1, ord=1, axis=0)))
print("inf-norm: ", np.inner(weights, np.linalg.norm(pd0-pd1, ord=np.inf, axis=0)))

This weighting makes the flag differences 5 times more important than the time and number differences, and the date differences two times more important. This prints:

2-norm:    16.66379250817543
1-norm:    25.49044
inf-norm:  12.77054