I am attempting to replicate the character level language modeling demonstrated in the excellent article http://karpathy.github.io/2015/05/21/rnn-effectiveness/ using Tensorflow.
So far my attempts have failed. My network typically outputs a single character after processing 800 or so characters. I believe I have fundamentally misunderstood the way tensor flow has implemented LSTMs, and perhaps rnns in general. I am finding the documentation to be difficult to follow.
Here is the essence of my code:
Graph definition
idata = tf.placeholder(tf.int32,[None,1]) #input byte, use value 256 for start and end of file
odata = tf.placeholder(tf.int32,[None,1]) #target output byte, ie, next byte in sequence..
source = tf.to_float(tf.one_hot(idata,257)) #input byte as 1-hot float
target = tf.to_float(tf.one_hot(odata,257)) #target output as 1-hot float
with tf.variable_scope("lstm01"):
cell1 = tf.nn.rnn_cell.BasicLSTMCell(257)
val1, state1 = tf.nn.dynamic_rnn(cell1, source, dtype=tf.float32)
output = val1
Loss Calculation
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, target))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
output_am = tf.argmax(output,2)
target_am = tf.argmax(target,2)
correct_prediction = tf.equal(output_am, target_am)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
Training
for i in range(0, source_data.size-1, batch_size):
start = i
stop = i + batch_size
i_data = source_data[start:stop].reshape([-1,1])
o_data = source_data[start+1:stop+1].reshape([-1,1])
train_step.run(feed_dict={idata: i_data, odata: o_data})
if i%(report_interval*batch_size) == 0:
batch_out, fa = sess.run([output_am, accuracy], feed_dict={idata: i_data, odata: o_data, keep_prob: 1.0})
print("step %d, training accuracy %s"%(i, str(fa)))
print("i_data sample: %s"%str(squeeze(i_data)))
print("o_data sample: %s"%str(squeeze(o_data)))
print("batch sample: %s"%str(squeeze(batch_out)))
Output, using 1MB Shakespere file to train
step 0, training accuracy 0.0
i_data sample: [ 256. 70. 105. 114. 115. 116. 32. 67. 105. 116.]
o_data sample: [ 70. 105. 114. 115. 116. 32. 67. 105. 116. 105.]
batch sample: [254 18 151 64 51 199 83 174 151 199]
step 400, training accuracy 0.2
i_data sample: [ 32. 98. 101. 32. 100. 111. 110. 101. 58. 32.]
o_data sample: [ 98. 101. 32. 100. 111. 110. 101. 58. 32. 97.]
batch sample: [ 32 101 32 32 32 32 10 32 101 32]
step 800, training accuracy 0.0
i_data sample: [ 112. 97. 114. 116. 105. 99. 117. 108. 97. 114.]
o_data sample: [ 97. 114. 116. 105. 99. 117. 108. 97. 114. 105.]
batch sample: [101 101 101 32 101 101 32 101 101 101]
step 1200, training accuracy 0.1
i_data sample: [ 63. 10. 10. 70. 105. 114. 115. 116. 32. 67.]
o_data sample: [ 10. 10. 70. 105. 114. 115. 116. 32. 67. 105.]
batch sample: [ 32 32 32 101 32 32 32 32 32 32]
step 1600, training accuracy 0.2
i_data sample: [ 32. 116. 105. 108. 108. 32. 116. 104. 101. 32.]
o_data sample: [ 116. 105. 108. 108. 32. 116. 104. 101. 32. 97.]
batch sample: [32 32 32 32 32 32 32 32 32 32]
This is clearly incorrect.
I think I am getting confused by the difference between 'batches' and 'sequences', and as to whether or not the state of the LSTM is preserved between what I call 'batches' (ie, sub-sequences)
I'm getting the impression that I've trained it using 'batches' of sequences of length 1, and that between each batch, state data is discarded. Consequently it is simply finding the most commonly occurring symbol.
Can anyone confirm this, or otherwise correct my mistake, and give some indication of how I should go about the task of character by character prediction using very long training sequences?
Many Thanks.