kacey@ieee.org

Language Modeling with RNNs

  • Objective: Understand Recurrent Neural Networks by implementing a vanilla RNN and an LSTM to train a model that can generate text
  • See lib/layer_utils.py for the definitions different layer type classes (RNN and LSTM)
  • See lib/rnn.py for the implementation of the text generation model.
In [1]:
from lib.rnn import *
from lib.layer_utils import *
from lib.grad_check import *
from lib.optim import *
from lib.train import *
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

Vanilla RNN

Vanilla RNN: step forward

Testing the forward pass for a single timestep of a basic RNN cell defined in lib/layer_utils.py

In [2]:
%reload_ext autoreload

N, D, H = 3, 10, 4

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")
x = np.linspace(-0.4, 0.7, num=N*D).reshape(N, D)
prev_h = np.linspace(-0.2, 0.5, num=N*H).reshape(N, H)

rnn.params[rnn.wx_name] = np.linspace(-0.1, 0.9, num=D*H).reshape(D, H)
rnn.params[rnn.wh_name] = np.linspace(-0.3, 0.7, num=H*H).reshape(H, H)
rnn.params[rnn.b_name] = np.linspace(-0.2, 0.4, num=H)

next_h, _ = rnn.step_forward(x, prev_h)
expected_next_h = np.asarray([
  [-0.58172089, -0.50182032, -0.41232771, -0.31410098],
  [ 0.66854692,  0.79562378,  0.87755553,  0.92795967],
  [ 0.97934501,  0.99144213,  0.99646691,  0.99854353]])
print(next_h)

print('next_h error: ', rel_error(expected_next_h, next_h))
self.params[self.wx_name]  (10, 4)
self.params[self.wh_name]:  (4, 4)
self.params[self.b_name]:  (4,)
[[-0.58172089 -0.50182032 -0.41232771 -0.31410098]
 [ 0.66854692  0.79562378  0.87755553  0.92795967]
 [ 0.97934501  0.99144213  0.99646691  0.99854353]]
next_h error:  6.292421426471037e-09

Vanilla RNN: step backward

Checking the gradient calculations of the VanillaRNN class in the file lib/layer_utils.py

In [3]:
%reload_ext autoreload

np.random.seed(231)
N, D, H = 4, 5, 6

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")

x = np.random.randn(N, D)
h = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)

rnn.params[rnn.wx_name] = Wx
rnn.params[rnn.wh_name] = Wh
rnn.params[rnn.b_name] = b

out, meta = rnn.step_forward(x, h)

dnext_h = np.random.randn(*out.shape)

dx_num = eval_numerical_gradient_array(lambda x: rnn.step_forward(x, h)[0], x, dnext_h)
dprev_h_num = eval_numerical_gradient_array(lambda h: rnn.step_forward(x, h)[0], h, dnext_h)
dWx_num = eval_numerical_gradient_array(lambda Wx: rnn.step_forward(x, h)[0], Wx, dnext_h)
dWh_num = eval_numerical_gradient_array(lambda Wh: rnn.step_forward(x, h)[0], Wh, dnext_h)
db_num = eval_numerical_gradient_array(lambda b: rnn.step_forward(x, h)[0], b, dnext_h)

dx, dprev_h, dWx, dWh, db = rnn.step_backward(dnext_h, meta)

print('dx error: ', rel_error(dx_num, dx))
print('dprev_h error: ', rel_error(dprev_h_num, dprev_h))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))
self.params[self.wx_name]  (5, 6)
self.params[self.wh_name]:  (6, 6)
self.params[self.b_name]:  (6,)
dx error:  9.956054783068812e-10
dprev_h error:  2.349821195946612e-10
dWx error:  1.9370107669274995e-10
dWh error:  3.186343591539138e-10
db error:  6.415453358481749e-11

Vanilla RNN: forward

Test VanillaRNN by processing a sequence of data.

In [4]:
%reload_ext autoreload

N, T, D, H = 2, 3, 4, 5

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")

x = np.linspace(-0.1, 0.3, num=N*T*D).reshape(N, T, D)
h0 = np.linspace(-0.3, 0.1, num=N*H).reshape(N, H)
Wx = np.linspace(-0.2, 0.4, num=D*H).reshape(D, H)
Wh = np.linspace(-0.4, 0.1, num=H*H).reshape(H, H)
b = np.linspace(-0.7, 0.1, num=H)

rnn.params[rnn.wx_name] = Wx
rnn.params[rnn.wh_name] = Wh
rnn.params[rnn.b_name] = b

h = rnn.forward(x, h0)
expected_h = np.asarray([
  [
    [-0.42070749, -0.27279261, -0.11074945,  0.05740409,  0.22236251],
    [-0.39525808, -0.22554661, -0.0409454,   0.14649412,  0.32397316],
    [-0.42305111, -0.24223728, -0.04287027,  0.15997045,  0.35014525],
  ],
  [
    [-0.55857474, -0.39065825, -0.19198182,  0.02378408,  0.23735671],
    [-0.27150199, -0.07088804,  0.13562939,  0.33099728,  0.50158768],
    [-0.51014825, -0.30524429, -0.06755202,  0.17806392,  0.40333043]]])
print("expected_h: ",expected_h.shape)
print('h error: ', rel_error(expected_h, h))
self.params[self.wx_name]  (4, 5)
self.params[self.wh_name]:  (5, 5)
self.params[self.b_name]:  (5,)
expected_h:  (2, 3, 5)
h error:  7.728466180186066e-08

Vanilla RNN: backward

Test back-propagation over the entire sequence by calling the step_backward function

In [5]:
%reload_ext autoreload

np.random.seed(231)

N, D, T, H = 2, 3, 10, 5

rnn = VanillaRNN(D, H, init_scale=0.02, name="rnn_test")

x = np.random.randn(N, T, D)
h0 = np.random.randn(N, H)
Wx = np.random.randn(D, H)
Wh = np.random.randn(H, H)
b = np.random.randn(H)

rnn.params[rnn.wx_name] = Wx
rnn.params[rnn.wh_name] = Wh
rnn.params[rnn.b_name] = b

out = rnn.forward(x, h0)

dout = np.random.randn(*out.shape)

dx, dh0 = rnn.backward(dout)

dx_num = eval_numerical_gradient_array(lambda x: rnn.forward(x, h0), x, dout)
dh0_num = eval_numerical_gradient_array(lambda h0: rnn.forward(x, h0), h0, dout)
dWx_num = eval_numerical_gradient_array(lambda Wx: rnn.forward(x, h0), Wx, dout)
dWh_num = eval_numerical_gradient_array(lambda Wh: rnn.forward(x, h0), Wh, dout)
db_num = eval_numerical_gradient_array(lambda b: rnn.forward(x, h0), b, dout)

dWx = rnn.grads[rnn.wx_name]
dWh = rnn.grads[rnn.wh_name]
db = rnn.grads[rnn.b_name]

print('dx error: ', rel_error(dx_num, dx))
print('dh0 error: ', rel_error(dh0_num, dh0))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))
print(dh0_num)
print(dh0)
self.params[self.wx_name]  (3, 5)
self.params[self.wh_name]:  (5, 5)
self.params[self.b_name]:  (5,)
dx error:  2.736928435887175e-08
dh0 error:  8.231409890331713e-10
dWx error:  2.0789178600982087e-08
dWh error:  1.5210912998171804e-08
db error:  2.77914406963907e-10
[[-4.28153694 -2.74230889  0.71964976 -1.18508456 -0.8895025 ]
 [ 0.5942948  -0.86422636 -1.14307499 -0.07620721 -1.10608234]]
[[-4.28153694 -2.74230889  0.71964976 -1.18508456 -0.8895025 ]
 [ 0.5942948  -0.86422636 -1.14307499 -0.07620721 -1.10608234]]

Word embedding

Word embedding: forward

Checking implementation of the function forward in the word_embedding class

In [6]:
%reload_ext autoreload

N, T, V, D = 2, 4, 5, 3

we = word_embedding(V, D, name="we")

x = np.asarray([[0, 3, 1, 2], [2, 1, 0, 3]])
W = np.linspace(0, 1, num=V*D).reshape(V, D)

we.params[we.w_name] = W

out = we.forward(x)
print(out)
expected_out = np.asarray([
 [[ 0.,          0.07142857,  0.14285714],
  [ 0.64285714,  0.71428571,  0.78571429],
  [ 0.21428571,  0.28571429,  0.35714286],
  [ 0.42857143,  0.5,         0.57142857]],
 [[ 0.42857143,  0.5,         0.57142857],
  [ 0.21428571,  0.28571429,  0.35714286],
  [ 0.,          0.07142857,  0.14285714],
  [ 0.64285714,  0.71428571,  0.78571429]]])

print('out error: ', rel_error(expected_out, out))
[[[0.         0.07142857 0.14285714]
  [0.64285714 0.71428571 0.78571429]
  [0.21428571 0.28571429 0.35714286]
  [0.42857143 0.5        0.57142857]]

 [[0.42857143 0.5        0.57142857]
  [0.21428571 0.28571429 0.35714286]
  [0.         0.07142857 0.14285714]
  [0.64285714 0.71428571 0.78571429]]]
out error:  1.0000000094736443e-08

Word embedding: backward

Checking implementation of the function backward in the word_embedding class

In [7]:
%reload_ext autoreload

np.random.seed(231)

N, T, V, D = 50, 3, 5, 6

we = word_embedding(V, D, name="we")

x = np.random.randint(V, size=(N, T))
W = np.random.randn(V, D)

we.params[we.w_name] = W

out = we.forward(x)
dout = np.random.randn(*out.shape)
we.backward(dout)

dW = we.grads[we.w_name]

f = lambda W: we.forward(x)
dW_num = eval_numerical_gradient_array(f, W, dout)

print('dW error: ', rel_error(dW, dW_num))
dW error:  3.2759440934795915e-12

Temporal Fully Connected layer

Checking implementation of the temporal_fc class, which defines a layer that uses an affine function to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary

In [8]:
%reload_ext autoreload

np.random.seed(231)

# Gradient check for temporal affine layer
N, T, D, M = 2, 3, 4, 5

t_fc = temporal_fc(D, M, init_scale=0.02, name='test_t_fc')

x = np.random.randn(N, T, D)
w = np.random.randn(D, M)
b = np.random.randn(M)

t_fc.params[t_fc.w_name] = w
t_fc.params[t_fc.b_name] = b

out = t_fc.forward(x)

dout = np.random.randn(*out.shape)

dx_num = eval_numerical_gradient_array(lambda x: t_fc.forward(x), x, dout)
dw_num = eval_numerical_gradient_array(lambda w: t_fc.forward(x), w, dout)
db_num = eval_numerical_gradient_array(lambda b: t_fc.forward(x), b, dout)

dx = t_fc.backward(dout)
dw = t_fc.grads[t_fc.w_name]
db = t_fc.grads[t_fc.b_name]

print('dx error: ', rel_error(dx_num, dx))
print('dw error: ', rel_error(dw_num, dw))
print('db error: ', rel_error(db_num, db))
dx error:  3.2269470390098687e-10
dw error:  3.8595619942595054e-11
db error:  1.1455396263586309e-11

Temporal Softmax Cross-Entropy loss

When rolling out a RNN language model to generate a sentence, a score for each word in the vocabulary is produced at every timestep. This score is propotional to the predicted likelihood of this word appearing at the particular timestep in the sentence. Because the ground-truth word at each timestep is known, softmax cross-entropy loss function is used to:

  • Compute a proper probability distribution over the words in the vocabulary at every time step
  • Compute loss and gradient at each timestep. We sum the losses over time and average them over the minibatch.

(Loss function: temporal_softmax_CE_loss inlib/layer_utils.py)

In [9]:
%reload_ext autoreload

loss_func = temporal_softmax_CE_loss()

# Sanity check for temporal softmax loss
N, T, V = 100, 1, 10

def check_loss(N, T, V, p):
    x = 0.001 * np.random.randn(N, T, V)
    y = np.random.randint(V, size=(N, T))
    mask = np.random.rand(N, T) <= p
    print(loss_func.forward(x, y, mask))
  
check_loss(100, 1, 10, 1.0)   # Should be about 2.3
check_loss(100, 10, 10, 1.0)  # Should be about 23
check_loss(5000, 10, 10, 0.1) # Should be about 2.3

# Gradient check for temporal softmax loss
N, T, V = 7, 8, 9

x = np.random.randn(N, T, V)
y = np.random.randint(V, size=(N, T))
mask = (np.random.rand(N, T) > 0.5)

loss = loss_func.forward(x, y, mask)
dx = loss_func.backward()

dx_num = eval_numerical_gradient(lambda x: loss_func.forward(x, y, mask), x, verbose=False)

print('dx error: ', rel_error(dx, dx_num))
2.3026547279318357
23.026307039328714
2.2989009292538665
dx error:  4.0464746298031226e-08

RNN for language modeling

Check the forward and backward pass of the TestRNN class using a small test case

In [10]:
%reload_ext autoreload

N, D, H = 10, 20, 40
V = 4
T = 13

model = TestRNN(D, H, cell_type='rnn')
loss_func = temporal_softmax_CE_loss()

# Set all model parameters to fixed values
for k, v in model.params.items():
    model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)
model.assign_params()

features = np.linspace(-1.5, 0.3, num=(N * D * T)).reshape(N, T, D)
h0 = np.linspace(-1.5, 0.5, num=(N*H)).reshape(N, H)
labels = (np.arange(N * T) % V).reshape(N, T)

pred = model.forward(features, h0)

# You'll need this
mask = np.ones((N, T))

loss = loss_func.forward(pred, labels, mask)
dLoss = loss_func.backward()

expected_loss = 51.0949189134

print('loss: ', loss)
print('expected loss: ', expected_loss)
print('difference: ', abs(loss - expected_loss))
self.params[self.wx_name]  (20, 40)
self.params[self.wh_name]:  (40, 40)
self.params[self.b_name]:  (40,)
loss:  51.094918913361184
expected loss:  51.0949189134
difference:  3.881694965457427e-11

Detailed gradient checking on the backward pass of the TestRNN class

In [11]:
%reload_ext autoreload

np.random.seed(231)

batch_size = 2
timesteps = 3
input_dim = 4
hidden_dim = 6
label_size = 4

labels = np.random.randint(label_size, size=(batch_size, timesteps))
features = np.random.randn(batch_size, timesteps, input_dim)
h0 = np.random.randn(batch_size, hidden_dim)

model = TestRNN(input_dim, hidden_dim, cell_type='rnn')
loss_func = temporal_softmax_CE_loss()

pred = model.forward(features, h0)

# You'll need this
mask = np.ones((batch_size, timesteps))

loss = loss_func.forward(pred, labels, mask)
dLoss = loss_func.backward()

dout, dh0 = model.backward(dLoss)

grads = model.grads

for param_name in sorted(grads):
    f = lambda _: loss_func.forward(model.forward(features, h0), labels, mask)
    param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)
    e = rel_error(param_grad_num, grads[param_name])
    print('%s relative error: %e' % (param_name, e))
self.params[self.wx_name]  (4, 6)
self.params[self.wh_name]:  (6, 6)
self.params[self.b_name]:  (6,)
vanilla_rnn_b relative error: 9.451394e-08
vanilla_rnn_wh relative error: 3.221744e-08
vanilla_rnn_wx relative error: 9.508480e-08

LSTM: Theory

Vanilla RNNs can be tough to train on long sequences due to vanishing and exploding gradiants. LSTMs solve this problem by replacing the simple update rule in the forward step of the vanilla RNN with a gating mechanism as follows.

Similar to the vanilla RNN, at each timestep we receive an input $x_t\in\mathbb{R}^D$ and the previous hidden state $h_{t-1}\in\mathbb{R}^H$. Crucially, the LSTM also maintains an $H$-dimensional cell state, so we also receive the previous cell state $c_{t-1}\in\mathbb{R}^H$. The learnable parameters of the LSTM are an input-to-hidden matrix $W_x\in\mathbb{R}^{4H\times D}$, a hidden-to-hidden matrix $W_h\in\mathbb{R}^{4H\times H}$ and a bias vector $b\in\mathbb{R}^{4H}$.

At each timestep we first compute an activation vector $a\in\mathbb{R}^{4H}$ as $a=W_xx_t + W_hh_{t-1}+b$. We then divide this into four vectors $a_i,a_f,a_o,a_g\in\mathbb{R}^H$ where $a_i$ consists of the first $H$ elements of $a$, $a_f$ is the next $H$ elements of $a$, etc. We then compute the input gate $g\in\mathbb{R}^H$, forget gate $f\in\mathbb{R}^H$, output gate $o\in\mathbb{R}^H$ and gate gate $g\in\mathbb{R}^H$ as

$$ \begin{align*} i = \sigma(a_i) \hspace{2pc} f = \sigma(a_f) \hspace{2pc} o = \sigma(a_o) \hspace{2pc} g = \tanh(a_g) \end{align*} $$

where $\sigma$ is the sigmoid function and $\tanh$ is the hyperbolic tangent, both applied elementwise.

Finally we compute the next cell state $c_t$ and next hidden state $h_t$ as

$$ c_{t} = f\odot c_{t-1} + i\odot g \hspace{4pc} h_t = o\odot\tanh(c_t) $$

where $\odot$ is the elementwise product of vectors.

In the rest of the notebook we will implement the LSTM update rule and apply it to the text generation task.

In the code, we assume that data is stored in batches so that $X_t \in \mathbb{R}^{N\times D}$, and will work with transposed versions of the parameters: $W_x \in \mathbb{R}^{D \times 4H}$, $W_h \in \mathbb{R}^{H\times 4H}$ so that activations $A \in \mathbb{R}^{N\times 4H}$ can be computed efficiently as $A = X_t W_x + H_{t-1} W_h$

LSTM: step forward

Testing lstm.step_forward for a single timestep

In [12]:
%reload_ext autoreload

N, D, H = 3, 4, 5

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.linspace(-0.4, 1.2, num=N*D).reshape(N, D)
prev_h = np.linspace(-0.3, 0.7, num=N*H).reshape(N, H)
prev_c = np.linspace(-0.4, 0.9, num=N*H).reshape(N, H)
Wx = np.linspace(-2.1, 1.3, num=4*D*H).reshape(D, 4 * H)
Wh = np.linspace(-0.7, 2.2, num=4*H*H).reshape(H, 4 * H)
b = np.linspace(0.3, 0.7, num=4*H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

next_h, next_c, cache = lstm.step_forward(x, prev_h, prev_c)

expected_next_h = np.asarray([
    [ 0.24635157,  0.28610883,  0.32240467,  0.35525807,  0.38474904],
    [ 0.49223563,  0.55611431,  0.61507696,  0.66844003,  0.7159181 ],
    [ 0.56735664,  0.66310127,  0.74419266,  0.80889665,  0.858299  ]])
expected_next_c = np.asarray([
    [ 0.32986176,  0.39145139,  0.451556,    0.51014116,  0.56717407],
    [ 0.66382255,  0.76674007,  0.87195994,  0.97902709,  1.08751345],
    [ 0.74192008,  0.90592151,  1.07717006,  1.25120233,  1.42395676]])

print('next_h error: ', rel_error(expected_next_h, next_h))
print('next_c error: ', rel_error(expected_next_c, next_c))
next_h error:  5.7054131185818695e-09
next_c error:  5.8143123088804145e-09

LSTM: step backward

Testing lstm.step_backward for a single timestep

In [13]:
%reload_ext autoreload

np.random.seed(231)

N, D, H = 4, 5, 6

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.random.randn(N, D)
prev_h = np.random.randn(N, H)
prev_c = np.random.randn(N, H)
Wx = np.random.randn(D, 4 * H)
Wh = np.random.randn(H, 4 * H)
b = np.random.randn(4 * H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

next_h, next_c, cache = lstm.step_forward(x, prev_h, prev_c)

dnext_h = np.random.randn(*next_h.shape)
dnext_c = np.random.randn(*next_c.shape)

fx_h = lambda x: lstm.step_forward(x, prev_h, prev_c)[0]
fh_h = lambda h: lstm.step_forward(x, prev_h, prev_c)[0]
fc_h = lambda c: lstm.step_forward(x, prev_h, prev_c)[0]
fWx_h = lambda Wx: lstm.step_forward(x, prev_h, prev_c)[0]
fWh_h = lambda Wh: lstm.step_forward(x, prev_h, prev_c)[0]
fb_h = lambda b: lstm.step_forward(x, prev_h, prev_c)[0]

fx_c = lambda x: lstm.step_forward(x, prev_h, prev_c)[1]
fh_c = lambda h: lstm.step_forward(x, prev_h, prev_c)[1]
fc_c = lambda c: lstm.step_forward(x, prev_h, prev_c)[1]
fWx_c = lambda Wx: lstm.step_forward(x, prev_h, prev_c)[1]
fWh_c = lambda Wh: lstm.step_forward(x, prev_h, prev_c)[1]
fb_c = lambda b: lstm.step_forward(x, prev_h, prev_c)[1]

num_grad = eval_numerical_gradient_array

dx_num = num_grad(fx_h, x, dnext_h) + num_grad(fx_c, x, dnext_c)
dh_num = num_grad(fh_h, prev_h, dnext_h) + num_grad(fh_c, prev_h, dnext_c)
dc_num = num_grad(fc_h, prev_c, dnext_h) + num_grad(fc_c, prev_c, dnext_c)
dWx_num = num_grad(fWx_h, Wx, dnext_h) + num_grad(fWx_c, Wx, dnext_c)
dWh_num = num_grad(fWh_h, Wh, dnext_h) + num_grad(fWh_c, Wh, dnext_c)
db_num = num_grad(fb_h, b, dnext_h) + num_grad(fb_c, b, dnext_c)

dx, dh, dc, dWx, dWh, db = lstm.step_backward(dnext_h, dnext_c, cache)

print('dx error: ', rel_error(dx_num, dx))
print('dh error: ', rel_error(dh_num, dh))
print('dc error: ', rel_error(dc_num, dc))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))
dx error:  7.10174722192564e-10
dh error:  1.02587271120523e-08
dc error:  1.0127281079074958e-08
dWx error:  7.155327183583897e-08
dWh error:  9.784434021608716e-08
db error:  1.867169717722288e-08

LSTM: forward

Testing lstm.forward for an entire timeseries of data.

In [14]:
%reload_ext autoreload

N, D, H, T = 2, 5, 4, 3

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.linspace(-0.4, 0.6, num=N*T*D).reshape(N, T, D)
h0 = np.linspace(-0.4, 0.8, num=N*H).reshape(N, H)
Wx = np.linspace(-0.2, 0.9, num=4*D*H).reshape(D, 4 * H)
Wh = np.linspace(-0.3, 0.6, num=4*H*H).reshape(H, 4 * H)
b = np.linspace(0.2, 0.7, num=4*H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

h = lstm.forward(x, h0)

expected_h = np.asarray([
 [[ 0.01764008,  0.01823233,  0.01882671,  0.0194232 ],
  [ 0.11287491,  0.12146228,  0.13018446,  0.13902939],
  [ 0.31358768,  0.33338627,  0.35304453,  0.37250975]],
 [[ 0.45767879,  0.4761092,   0.4936887,   0.51041945],
  [ 0.6704845,   0.69350089,  0.71486014,  0.7346449 ],
  [ 0.81733511,  0.83677871,  0.85403753,  0.86935314]]])

print('h error: ', rel_error(expected_h, h))
h error:  8.610537442272635e-08

LSTM: backward

Testing lstm.backward for an entire timeseries of data.

In [15]:
%reload_ext autoreload

np.random.seed(231)

N, D, T, H = 2, 3, 10, 6

lstm = LSTM(D, H, init_scale=0.02, name='test_lstm')

x = np.random.randn(N, T, D)
h0 = np.random.randn(N, H)
Wx = np.random.randn(D, 4 * H)
Wh = np.random.randn(H, 4 * H)
b = np.random.randn(4 * H)

lstm.params[lstm.wx_name] = Wx
lstm.params[lstm.wh_name] = Wh
lstm.params[lstm.b_name] = b

out = lstm.forward(x, h0)

dout = np.random.randn(*out.shape)

dx, dh0 = lstm.backward(dout)
dWx = lstm.grads[lstm.wx_name] 
dWh = lstm.grads[lstm.wh_name]
db = lstm.grads[lstm.b_name]

dx_num = eval_numerical_gradient_array(lambda x: lstm.forward(x, h0), x, dout)
dh0_num = eval_numerical_gradient_array(lambda h0: lstm.forward(x, h0), h0, dout)
dWx_num = eval_numerical_gradient_array(lambda Wx: lstm.forward(x, h0), Wx, dout)
dWh_num = eval_numerical_gradient_array(lambda Wh: lstm.forward(x, h0), Wh, dout)
db_num = eval_numerical_gradient_array(lambda b: lstm.forward(x, h0), b, dout)

print('dx error: ', rel_error(dx_num, dx))
print('dh0 error: ', rel_error(dh0_num, dh0))
print('dWx error: ', rel_error(dWx_num, dWx))
print('dWh error: ', rel_error(dWh_num, dWh))
print('db error: ', rel_error(db_num, db))
dx error:  1.1836090780211367e-09
dh0 error:  4.534953544517244e-10
dWx error:  1.6965522923029384e-09
dWh error:  1.310330323958345e-07
db error:  3.715229506682041e-10

LSTM model

Testing lstm implementation using TestNN with cell_type='lstm'

In [16]:
%reload_ext autoreload

N, D, H = 10, 20, 40
V = 4
T = 13

model = TestRNN(D, H, cell_type='lstm')
loss_func = temporal_softmax_CE_loss()

# Set all model parameters to fixed values
for k, v in model.params.items():
    model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)
model.assign_params()

features = np.linspace(-1.5, 0.3, num=(N * D * T)).reshape(N, T, D)
h0 = np.linspace(-1.5, 0.5, num=(N*H)).reshape(N, H)
labels = (np.arange(N * T) % V).reshape(N, T)

pred = model.forward(features, h0)

# You'll need this
mask = np.ones((N, T))

loss = loss_func.forward(pred, labels, mask)
dLoss = loss_func.backward()

expected_loss = 49.2140256354

print('loss: ', loss)
print('expected loss: ', expected_loss)
print('difference: ', abs(loss - expected_loss))
loss:  49.21402563544293
expected loss:  49.2140256354
difference:  4.293099209462525e-11

Generating Text

Train RNN on Alice's Adventures in Wonderland (Text Source: link, Project Gutenberg)

(To simplify training, only the first chapter is used here)

In [17]:
%reload_ext autoreload

input_file = open("data/alice.txt", "r")
input_text = input_file.readlines()
input_text = ''.join(input_text)

Construct the training dataset.

In [31]:
%reload_ext autoreload

import re

text = re.split(' |\n',input_text.lower())  # all words are converted into lower case
outputSize = len(text)
word_list = list(set(text))
dataSize = len(word_list)
output = np.zeros(outputSize)
for i in range(0, outputSize):
    index = np.where(np.asarray(word_list) == text[i])
    output[i] = index[0]
data = output.astype(np.int)
gt_labels = data[1:]
input_data = data[:-1]

print('Input text size: %s' % outputSize)
print('Input word number: %s' % dataSize)
Input text size: 2170
Input word number: 778

Create and train an instance of the LanguageModelRNN class defined in rnn.py

RNN Architecture:

  • a word_embedding layer
  • recurrent unit
  • temporal fully connected layer
In [36]:
%reload_ext autoreload

# you can change the following parameters.
D = 10  # input dimension
H = 20  # hidden space dimension
T = 100  # timesteps
N = 2 # batch size
max_epoch = 50  # max epoch size

loss_func = temporal_softmax_CE_loss()
# you can change the cell_type between 'rnn' and 'lstm'.
model = LanguageModelRNN(dataSize, D, H, cell_type='lstm')
optimizer = Adam(model, 5e-4)

data = {'data_train': input_data, 'labels_train': gt_labels}

results = train_net(data, model, loss_func, optimizer, timesteps=T, batch_size=N, max_epochs=max_epoch, verbose=True)
(Iteration 1 / 54200) loss: 665.7069847501157
(Iteration 501 / 54200) loss: 575.1992073450804
(Iteration 1001 / 54200) loss: 604.9944860569382
best performance 3.5961272475795294%
...
(Epoch 45 / 50) Training Accuracy: 0.9907791609036423
(Iteration 49001 / 54200) loss: 15.2759378574619
(Iteration 49501 / 54200) loss: 17.16236474943849
(Epoch 46 / 50) Training Accuracy: 0.9903181189488244
(Iteration 50001 / 54200) loss: 17.259392181153558
(Iteration 50501 / 54200) loss: 23.048019511928572
best performance 99.12402028584602%
(Epoch 47 / 50) Training Accuracy: 0.9912402028584602
(Iteration 51001 / 54200) loss: 9.120802359632746
(Iteration 51501 / 54200) loss: 13.80238781450254
(Iteration 52001 / 54200) loss: 19.713389523552195
best performance 99.1701244813278%
(Epoch 48 / 50) Training Accuracy: 0.991701244813278
(Iteration 52501 / 54200) loss: 13.116221374379055
(Iteration 53001 / 54200) loss: 7.700536077057679
best performance 99.21622867680959%
(Epoch 49 / 50) Training Accuracy: 0.9921622867680959
(Iteration 53501 / 54200) loss: 11.376942193152779
(Iteration 54001 / 54200) loss: 14.99557281489629
best performance 99.26233287229138%
(Epoch 50 / 50) Training Accuracy: 0.9926233287229138

Check the loss and accuracy curve. Modify hyperparameters as needed to improve accuracy.

In [37]:
%reload_ext autoreload

opt_params, loss_hist, train_acc_hist = results

# Plot the learning curves
plt.subplot(2, 1, 1)
plt.title('Training loss')
loss_hist_ = loss_hist[1::100]  # sparse the curve a bit
plt.plot(loss_hist_, '-o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(train_acc_hist, '-o', label='Training')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)

plt.show()

Generate text using the trained model

Example of expected output:

she was dozing off, and book-shelves; here and she tried to curtsey as she spoke--fancy curtseying as you’re falling through the little door into a dreamy sort of way, ‘do cats eat bats? do cats eat bats?’ and sometimes,

In [43]:
%reload_ext autoreload

# you can change the generated text length below.
text_length = 100
idx = 0
# you also can start from specific word. 
# since the words are all converted into lower case
idx = int(np.where(np.asarray(word_list) == 'She'.lower())[0])

# sample from the trained model
words = model.sample(idx, text_length-1)

# convert indices into words
output = [word_list[i] for i in words]
print(' '.join(output))
she was now the right size for going through the little door into that lovely garden. first, however, she waited for a few minutes to see if she was going to shrink any further: she felt a little nervous about this; ‘for it might end, you know,’ said alice to herself, ‘in my going out altogether, like a candle. i wonder what i should be like then?’ and she tried to fancy what the flame of a candle is like after the candle is blown out, for she could not remember ever having seen such a thing.  after a

Observations and Reflections

Vanilla RNN's perform worse than LSTM for the same hyperparameters. Text generated from RNN models can be more repetitive. This repetition becomes more noticible the greater the difference between the batch size used during training and the generated text length. Moreover, the text often does not make any sense symantically. Increasing the timesteps of the RNN technically improved the objective accuracy (>99%); however, the resulting text generated made no sense (ex. T=100 Rnn: "she croquet cried. lost: among got lying nice, bats? coming sitting curtsey that; burn plainly see, because thump! ...").

On the other hand, I was actually pretty impressed with the text generated with the LSTM model. Although the training accuracy was only slightly higher than that of the RNN model, the text generated made significantly more sense for all settings. Changing the timestep and text length settings didn't really affect the quailty of the text results. However, modifying the batch size resulted in improved accuracy scores form ~50% to >98%. This improved training accuracy was also observed to correspond to significantly better text generation.

The limitations of the Vanilla RNN could be attributed to the lack of both a cell state and a forget gate. The lack of a cell state means the RNN doesn't have long-term persistant memory, meaning consistency between sentences/phrases is not maintained. Also, the lack of a forget gate and the recurrance nature of the model means that longer training (larger T) may actually hurt performance and is suscepitble to outliers as the weights of uncommon patterns are propagated throughout the network, obscuring earlier learned patterns.

Sample observations: LSTM T = 50 # timesteps N = 2 # batch size (Iteration 54001 / 54200) loss: 19.704904365099825 best performance 98.2941447671738% (Epoch 50 / 50) Training Accuracy: 0.9829414476717381

(Iteration 54001 / 54200) loss: 14.99557281489629 best performance 99.26233287229138% (Epoch 50 / 50) Training Accuracy: 0.9926233287229138 she was now the right size for going through the little door into that lovely garden. first, however, she waited for a few minutes to see if she was going to shrink any further: she felt a little nervous about this; ‘for it might end, you know,’ said alice to herself, ‘in my going out altogether, like a candle. i wonder what i should be like then?’ and she tried to fancy what the flame of a candle is like after the candle is blown out, for she could not remember ever having seen such a thing. after a

Sample Observations: RNN batch size=2, 40 wrods Training Accuracy: 0.9511295527893038 she was now the right size for going through the little door about fifteen inches high: she tried the little door about fifteen inches high: she tried the little door about fifteen inches high: she tried the little door about

batch size 10, 100 words 10800 iterations best performance 72.52189949285385% (Epoch 50 / 50) Training Accuracy: 0.7252189949285385 she had never before seen one to be two people! why, there’s hardly enough of the well, and seemed began talking again. ‘dinah’ll miss me very much to-night, i shall have to ask them what the flame of a book,’ thought poor alice, ‘it would be shutting and was not a very good opportunity for showing off her knowledge, as she had never before seen one to be two people! why, there’s hardly enough of the well, and seemed began talking again. ‘dinah’ll miss me very much to-night, i shall have to ask them what the flame of a book,’