pythonscipyfeature-selectionmultivalue-database

Dealing with datasets with repeated multivalued features


We have a Dataset that is in sparse representation and has 25 features and 1 binary label. For example, a line of dataset is:

Label: 0
exid: 24924687
Features:
11:0 12:1 13:0 14:6 15:0 17:2 17:2 17:2 17:2 17:2 17:2
21:11 21:42 21:42 21:42 21:42 21:42 
22:35 22:76 22:27 22:28 22:25 22:15 24:1888
25:9 33:322 33:452 33:452 33:452 33:452 33:452 35:14

So, sometimes features have multiple values and they can be the same or different, and the website says:

Some categorical features are multi-valued (order does not matter)

We don't know what is the semantic of features and the value that have been assigned to them (because of some privacy concern they are hidden to public)

We only know:

Any comment on the following problems are appreciated:

  1. What's the best way to import this kind of datasets into a Python data structure.
  2. How to deal with multi-valued features, specially when they have similar values repeated k times?

Solution

  • That is very general question but as far as I can tell, if you want to aim to use some ML methods its sensible to transform the data into a tidy data format first.

    As far I cant tell from the documentation that @RootTwo nicely references in his comment, you are actually dealing with two datasets: one example flat table and one product flat table. (You can later join the two to get one table if so desired.)

    Let us first create some parsers that decode the different lines into somewhat informative data structure:

    For lines with examples we may use:

    def process_example(example_line):
        # example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1}
        #    0        1         2           3               4          5            6               7 ...
        feature_names = ['ex_id', 'hash', 'clicked', 'propensity', 'slots', 'candidates'] + \
                        ['display_feature_' + str(i) for i in range(1, 11)]
        are_numbers = [1, 3, 4, 5, 6]
        parts = example_line.split(' ')
        parts[1] = parts[1].replace(':', '')
        for i in are_numbers:
            parts[i] = float(parts[i])
            if parts[i].is_integer():
                parts[i] = int(parts[i])
        featues = [int(ft.split(':')[1]) for ft in parts[7:]]
        return dict(zip(feature_names, parts[1:7] + featues))
    

    This method is hacky but gets the job done: parse features and cast to numbers where possible. The output does look like:

    {'ex_id': 20184824,
     'hash': '57548fae76b0aa2f2e0d96c40ac6ae3057548faee00912d106fc65fc1fa92d68',
     'clicked': 0,
     'propensity': 1.416489e-07,
     'slots': 6,
     'candidates': 30,
     'display_feature_1': 728,
     'display_feature_2': 90,
     'display_feature_3': 1,
     'display_feature_4': 10,
     'display_feature_5': 16,
     'display_feature_6': 1,
     'display_feature_7': 26,
     'display_feature_8': 11,
     'display_feature_9': 597,
     'display_feature_10': 7}
    

    Next are the product examples. As you mentioned, the proble is the multiple occurance of values. I think it sensible to aggregate unique feature-value pair by their frequency. Information does not get lost, but it helps us to encode of tidy sample. That should address your second question.

    import toolz  # pip install toolz
    
    def process_product(product_line):
        # ${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...
        parts = product_line.split(' ')
        meta = {'label': int(parts[0]),
                'ex_id': int(parts[1].split(':')[1])}
        # extract feautes that are ${productFeat1_1}:${v1_1} separated by ':' into a dictionary
        features = [('product_feature_' + str(i), int(v))
                    for i, v in map(lambda x: x.split(':'), parts[2:])]
        # count each unique value and transform them into
        # feature_name X feature_value X feature_frequency
        products = [dict(zip(['feature', 'value', 'frequency'], (*k, v)))
                    for k, v in toolz.countby(toolz.identity, features).items()]
        # now merge the meta information into each product
        return [dict(p, **meta) for p in products]
    

    that basically extracts the label and features for each example (example for line 40):

    [{'feature': 'product_feature_11',
      'value': 0,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_12',
      'value': 1,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_13',
      'value': 0,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_14',
      'value': 2,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_15',
      'value': 0,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_17',
      'value': 2,
      'frequency': 2,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_21',
      'value': 55,
      'frequency': 2,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_22',
      'value': 14,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_22',
      'value': 54,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_24',
      'value': 3039,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_25',
      'value': 721,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_33',
      'value': 386,
      'frequency': 2,
      'label': 0,
      'ex_id': 19168103},
     {'feature': 'product_feature_35',
      'value': 963,
      'frequency': 1,
      'label': 0,
      'ex_id': 19168103}]
    

    So when you process your stream line by line, you can decide whether to map an example or a product:

    def process_stream(stream):
        for content in stream:
            if 'example' in content:
                yield process_example(content)
            else:
                yield process_product(content)
    

    I've decided to do a generator here because it will benefit processing data the functional way if you decide to not use pandas. Otherwise a list compresion will be your fried.

    Now the for the fun part: we read the lines from a given (example) url one by one and assign them into their corresponding datasets (example or product). I will use reduce here, because it is fun :-) . I'll not go into detail what map/reduce actually does (thats up to you). You can always use a simple for loop instead.

    import urllib.request
    import toolz  # pip install toolz
    
    lines_stream = (line.decode("utf-8").strip() 
                    for line in urllib.request.urlopen('http://www.cs.cornell.edu/~adith/Criteo/sample.txt'))
    
    # if you care about concise but hacky approach you could do:
    # blubb = list(toolz.partitionby(lambda x: 'hash' in x, process_file(lines_stream)))
    # examples_only = blubb[slice(0, len(blubb), 2)]
    # products_only = blubb[slice(1, len(blubb), 2)]
    
    # but to introduce some functional approach lets implement a reducer
    def dataset_reducer(datasets, content):
        which_one = 0 if 'hash' in content else 1
        datasets[which_one].append(content)
        return datasets
    
    # and process the stream using the reducer. Which results in two datasets:
    examples_dataset, product_dataset = toolz.reduce(dataset_reducer, process_stream(lines), [[], []])
    

    From here you can cast your datasets into a tidy dataframe that you can use to apply machine learning. Beware of NaN/missing values, distributions, etc. You can join the two datasets with merge to get one big flat table of samples X features. Then you will be more or less able use different methods from e.g. scikit-learn.

    import pandas
    
    examples_dataset = pandas.DataFrame(examples_dataset)
    product_dataset = pandas.concat(pandas.DataFrame(p) for p in product_dataset)
    

    Examples dataset

       candidates  clicked  ...    propensity  slots
    0          30        0  ...  1.416489e-07      6
    1          23        0  ...  5.344958e-01      3
    2          23        1  ...  1.774762e-04      3
    3          28        0  ...  1.158855e-04      6
    

    Product dataset (product_dataset.sample(10))

           ex_id             feature  frequency  label  value
    6   10244535  product_feature_21          1      0     10
    9   37375474  product_feature_25          1      0      4
    6   44432959  product_feature_25          1      0    263
    15  62131356  product_feature_35          1      0     14
    8   50383824  product_feature_24          1      0    228
    8   63624159  product_feature_20          1      0     30
    3   99375433  product_feature_14          1      0      0
    9    3389658  product_feature_25          1      0     43
    20  59461725  product_feature_31          8      0      4
    11  17247719  product_feature_21          3      0      5
    

    Be mindful about the product_dataset. You can 'pivot' you features in rows as columns (see reshaping docs).