pythonprimitivefeaturetools

How to implement custom naming for multioutput primitives in FeatureTools


As of version v0.12.0, FeatureTools allows you to assign custom names to multi-output primitives: https://github.com/alteryx/featuretools/pull/794. By default, the when you define custom multi-output primitives, the column names for the generated features are appended with a [0], [1], [2], etc. So let us say that I have the following code to output a multi-output primitive:

def sine_and_cosine_datestamp(column):
    """
    Returns the Sin and Cos of the hour of datestamp
    """
    sine_hour = np.sin(column.dt.hour)
    cosine_hour = np.cos(column.dt.hour)
    
    ret = [sine_hour, cosine_hour]
    return ret

Sine_Cosine_Datestamp = make_trans_primitive(function = sine_and_cosine_datestamp,
                                             input_types = [vtypes.Datetime],
                                             return_type = vtypes.Numeric,
                                             number_output_features = 2)

In the dataframe generated from DFS, the names of the two generated columns will be SINE_AND_COSINE_DATESTAMP(datestamp)[0] and SINE_AND_COSINE_DATESTAMP(datestamp)[1]. In actuality, I would have liked the names of the columns to reflect the operations being taken on the column. So I would have liked the column names to be something like SINE_AND_COSINE_DATESTAMP(datestamp)[sine] and SINE_AND_COSINE_DATESTAMP(datestamp)[cosine]. Apparently you have to use the generate_names method in order to do so. I could not find anything online to help me use this method and I kept running into errors. For example, when I tried the following code:

def sine_and_cosine_datestamp(column, string = ['sine, cosine']):
    """
    Returns the Sin and Cos of the hour of the datestamp
    """
    sine_hour = np.sin(column.dt.hour)
    cosine_hour = np.cos(column.dt.hour)
    
    ret = [sine_hour, cosine_hour]
    return ret

def sine_and_cosine_generate_names(self, base_feature_names):
    return u'STRING_COUNT(%s, "%s")' % (base_feature_names[0], self.kwargs['string'])

Sine_Cosine_Datestamp = make_trans_primitive(function = sine_and_cosine_datestamp,
                                             input_types = [vtypes.Datetime],
                                             return_type = vtypes.Numeric,
                                             number_output_features = 2,
                                             description = "For each value in the base feature"
                                             "outputs the sine and cosine of the hour, day, and month.",
                                             cls_attributes = {'generate_names': sine_and_cosine_generate_names})

I had gotten an assertion error. What's even more perplexing to me is that when I went into the transform_primitve_base.py file found in the featuretools/primitives/base folder, I saw that the generate_names function looks like this:

    def generate_names(self, base_feature_names):
        n = self.number_output_features
        base_name = self.generate_name(base_feature_names)
        return [base_name + "[%s]" % i for i in range(n)]

In the function above, it looks like there is no way that you can generate custom primitive names since it uses the base_feature_names and the number of output features by default. Any help would be appreciated.


Solution

  • Thanks for the question! This feature hasn't been documented well.

    The main issue with your code was that string_count_generate_name should return a list of strings, one for each column.

    It looks like you were adapting the StringCount example from the docs -- I think for this primitive it would be less error-prone to always use "sine" and "cosine" for the custom names, and remove the optional string argument from sine_and_cosine_datestamp. I also updated the feature name text to match your desired text.

    After these changes:

    def sine_and_cosine_datestamp(column):
        """
        Returns the Sin and Cos of the hour of the datestamp
        """
        sine_hour = np.sin(column.dt.hour)
        cosine_hour = np.cos(column.dt.hour)
        
        ret = [sine_hour, cosine_hour]
        return ret
    
    def sine_and_cosine_generate_names(self, base_feature_names):
        template = 'SINE_AND_COSINE_DATESTAMP(%s)[%s]'
        return [template % (base_feature_names[0], string) for string in ['sine', 'cosine']]
    

    This created feature column names like SINE_AND_COSINE_DATESTAMP(order_date)[sine]. No changes were necessary to the actual make_trans_primitive call.

    In the function above, it looks like there is no way that you can generate custom primitive names since it uses the base_feature_names and the number of output features by default.

    That is the default generate_names function for transform primitives. Since we are assigning this custom generate names function to Sine_Cosine_Datestamp , the default will not be used.

    Hope that helps, let me know if you still have questions!