pandastensorflowvocabulary

Tensorflow: How to feed in data in vocabulary feature column?


I'm currently working on a classification problem on text input basis and my main question is the following:

Am I correct in assuming that I can parse my complete sentence as one string to the vocabulary column or do I need to split the sentence in its words - like a list of strings?

My data looks something like this:

    A    B    text
1   ..   ..   My first example text
2   ..   ..   My second example text

(Beside my text input feature there are also some other categorical information - but they are not relevant in this context)

And my code looks basically like this:

// data import and data preparation

categorical_voc = tf.feature_column.categorical_column_with_vocabulary_list(key="text", vocabulary_list=vocabulary_list)

embedding_initializer = tf.random_uniform_initializer(-1.0, 1.0)

embed_column_dim = math.ceil(len(vocabulary_list) ** 0.25)
embed_column = tf.feature_column.embedding_column(
    categorical_column=categorical_voc,
    dimension=embed_column_dim,
    initializer=embedding_initializer,
    trainable=True)

estimator = tf.estimator.DNNClassifier(
    optimizer=optimizer,
    feature_columns=feature_columns,
    hidden_units=hidden_units,
    activation_fn=activation_fn,
    dropout=dropout,
    n_classes=target_size,
    label_vocabulary=target_list,
    config=config)

train_input_fn = tf.estimator.inputs.pandas_input_fn(
    x=train_data,
    y=train_target,
    batch_size=batch_size,
    num_epochs=1,
    shuffle=True)

estimator.train(input_fn=train_input_fn)

Thanks for your help :)

Edit 1: For the ones who need the custom input function.

def input_fn(features, labels, batch_size):
    if labels is None:
        dataset = tf.data.Dataset.from_tensor_slices(features)
    else:
        dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    # Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(100).repeat().batch(batch_size)
    return dataset

def train_input_fn():
    return input_fn(features=_train_data,
                    labels=_train_target,
                    batch_size=train_batch_size)

estimator.train(input_fn=lambda: train_input_fn(), steps=total_training_steps, hooks=train_hooks)

Solution

  • For those who had the same problem figuring out how to handle a sentence within a vocabulary column ..

    My conclusion so far is that I have to feed the vocabulary column with an array of strings. The only issue here is that the pandas_input_fn() does not support a series of lists. Thats why I went back to my custom input function!