I've got a few datasets and for each, I'd like to create a fake dataset that is kind of a representative of that dataset. I need to do it dynamically, only based on the type of data (numeric, obj)
Here's an example
import pandas as pd
import random
# Create a dictionary with columns as lists
data = {
'ObjectColumn1': [f'Object1_{i}' for i in range(1, 11)],
'ObjectColumn2': [f'Object2_{i}' for i in range(1, 11)],
'ObjectColumn3': [f'Object3_{i}' for i in range(1, 11)],
'NumericColumn1': [random.randint(1, 100) for _ in range(10)],
'NumericColumn2': [random.uniform(1.0, 10.0) for _ in range(10)],
'NumericColumn3': [random.randint(1000, 2000) for _ in range(10)],
'NumericColumn4': [random.uniform(10.0, 20.0) for _ in range(10)]
}
# Create the DataFrame
df = pd.DataFrame(data)
Let's say the above dataset has m (=3) object columns and n (=4) numeric columns. the dataset has x (=10) rows. I'd like to create a fake dataset of N (=10,000) rows, so that:
here's how fake_data should look like if N = 4
IIUC, something like this should do what you want. It separates the input dataframe into numeric and other columns, then takes random samples as described in the question from those columns, finally adding a list of extra data as a random sample from the supplied list:
def make_fake_data(df, N, extra):
df_obj = df.select_dtypes('object')
obj_out = pd.DataFrame({ col : np.random.choice(df_obj[col], N) for col in df_obj.columns })
df_num = df.select_dtypes('number')
num_out = pd.DataFrame({ col : np.random.uniform(np.nanmin(df_num[col]), np.nanmedian(df_num[col]), N) for col in df_num.columns })
ext_out = pd.DataFrame({ 'ExtraObjectColumn' : random.choices(extra, k=N) })
return pd.concat([obj_out, num_out, ext_out], axis=1)
Sample usage:
make_fake_data(df, 20, ['a', 'b', 'c', 'd'])
Sample output:
ObjectColumn1 ObjectColumn2 ObjectColumn3 ... NumericColumn3 NumericColumn4 ExtraObjectColumn
0 Object1_4 Object2_1 Object3_4 ... 1322.269370 14.502498 d
1 Object1_6 Object2_5 Object3_5 ... 1314.941227 12.478253 c
2 Object1_6 Object2_7 Object3_7 ... 1418.271732 11.214247 a
3 Object1_4 Object2_9 Object3_9 ... 1269.408303 11.404303 c
4 Object1_3 Object2_6 Object3_4 ... 1426.038132 14.251836 a
5 Object1_1 Object2_2 Object3_1 ... 1212.806903 14.750310 c
6 Object1_10 Object2_7 Object3_1 ... 1294.254746 10.692256 d
7 Object1_1 Object2_7 Object3_3 ... 1232.854020 10.438323 c
8 Object1_5 Object2_5 Object3_7 ... 1205.779688 14.763409 c
9 Object1_7 Object2_6 Object3_2 ... 1287.248660 10.384493 b
10 Object1_4 Object2_2 Object3_1 ... 1237.738855 14.054841 b
11 Object1_7 Object2_3 Object3_5 ... 1176.494651 12.869827 c
12 Object1_5 Object2_1 Object3_10 ... 1101.036149 10.978762 b
13 Object1_5 Object2_6 Object3_7 ... 1430.060873 13.473017 c
14 Object1_1 Object2_1 Object3_7 ... 1416.556459 12.281628 c
15 Object1_3 Object2_8 Object3_3 ... 1190.239080 15.257389 b
16 Object1_6 Object2_9 Object3_5 ... 1101.712808 10.551654 b
17 Object1_1 Object2_10 Object3_4 ... 1453.687960 15.070104 b
18 Object1_6 Object2_2 Object3_2 ... 1139.413534 11.744450 b
19 Object1_7 Object2_7 Object3_2 ... 1080.682206 13.962322 b