csvh2otraining-datamerging-datadriverless-ai

Does H2O Driverless AI have inbuilt support for merging multiple dataset and using the merged dataset for training?


Suppose we have three datasets containing data from a company.

  1. employee.csv : This dataset contains the details of the employees working in the company, like employee ID, employee name, dept id of the dept he works in, country code of the country where he is from and his annual salary.
  2. dept.csv : This dataset has information about the department of the company, like the dept id, dept name, dept specialization.
  3. country.csv : This dataset contains some country names with its country code and the capital city of the country.

Is there a feature in H2O Driverless AI where we can upload these datasets (without merging using python) and merge it in H2O Driverless AI platform and use it for training using overlapping columns ?


Solution

  • Yes, you can use a data recipe for processing datasets (including joining them). See the docs for more about data recipes. You can create a recipe that joins datasets.

    # Let's join a `employee.csv` (X) to `dept.csv` (Y1) and `country.csv` (Y2)
    # Define and read locations of datasets for Y1/Y2
    Y_file_name1 = "./tmp/user/location_of_dept.csv.bin"
    Y_file_name2 = "./tmp/user/location_of_country.csv.bin"
    Y1 = dt.fread(Y_file_name1)
    Y2 = dt.fread(Y_file_name2)
    
    # Set key and join Y1
    key1 = ["dept_id"]
    Y1.key = key1
    X = X[:, :, dt.join(Y1)]
    
    # Set key and join Y2
    key2 = ["country_code"]
    Y2.key = key2
    X = X[:, :, dt.join(Y2)]
    
    return X
    

    See this recipe as an example for joining one dataset to another.