I have two question on Tensorflow PTB RNN tutorial code ptb_word_lm.py. Code blocks below are from the code.
Is it okay to reset state for every batch?
self._initial_state = cell.zero_state(batch_size, data_type())
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [vocab_size, size], dtype=data_type())
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
if is_training and config.keep_prob < 1:
inputs = tf.nn.dropout(inputs, config.keep_prob)
outputs = []
state = self._initial_state
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:, time_step, :], state)
outputs.append(cell_output)
In line 133, we set the initial state as zero. Then, line 153, we use the zero state as the starting state of the rnn steps. It means that every starting state of batch is set to zero. I believe that if we want to apply BPTT(backpropagation through time), we should make external(non-zero) state input of step where previous data is finished, like stateful RNN (in Keras).
I found that resetting starting state to zero practically works. But is there any theoretical background (or paper) of why this works?
Is it okay to measure test perplexity like this?
eval_config = get_config()
eval_config.batch_size = 1
eval_config.num_steps = 1
Related to the previous question... The model fixes the initial state to zero for every batch. However, in line 337 ~ 338, we make batch size 1 and num steps 1 for test configuration. Then, for the test data, we will put single data each time and predict next one without context(!) because the state will be zero for every batch (with only one timestep).
Is this correct measure for the test data? Does every other language model papers measure test perplexity as predicting next word without context?
I ran this code and got a similar result as the code says and also the original paper says. If this code is wrong, which I hope not, do you have any idea how to replica the paper result? Maybe I can make a pull request if I modify the problems.
Re (1), the code does (cell_output, state) = cell(inputs[:, time_step, :], state)
. This assigns the state for the next time step to be the output state of this time step.
When you start a new batch you should do so independently from the computation you've done so far (note the distinction between batch, which are completely different examples, and time steps in the same sequence).
Re (2), most of the time context is used.