Suppose we have three datasets containing data from a company.
Is there a feature in H2O Driverless AI where we can upload these datasets (without merging using python) and merge it in H2O Driverless AI platform and use it for training using overlapping columns ?
Yes, you can use a data recipe for processing datasets (including joining them). See the docs for more about data recipes. You can create a recipe that joins datasets.
# Let's join a `employee.csv` (X) to `dept.csv` (Y1) and `country.csv` (Y2)
# Define and read locations of datasets for Y1/Y2
Y_file_name1 = "./tmp/user/location_of_dept.csv.bin"
Y_file_name2 = "./tmp/user/location_of_country.csv.bin"
Y1 = dt.fread(Y_file_name1)
Y2 = dt.fread(Y_file_name2)
# Set key and join Y1
key1 = ["dept_id"]
Y1.key = key1
X = X[:, :, dt.join(Y1)]
# Set key and join Y2
key2 = ["country_code"]
Y2.key = key2
X = X[:, :, dt.join(Y2)]
return X
See this recipe as an example for joining one dataset to another.