kacey@ieee.org

Image Classification Using Convolutional Neural Networks

Tasks:

  • Build a basic multi-layer fully-connected NN from scratch and train using the CIFAR-10 dataset
  • Implement and compare different optimizers and activation functions
  • Observe the effects of dropout

Note: The objective of this notebook is to understand the components of an image classification model. Therefore, although normally I would be using established deep learning frameworks such as Tensorflow or PyTorch to build and train my model, I will be defining my own library of functions in the folder [lib/] and testing them here.

Setup

In [1]:
# Instead of using Tensorflow or PyTorch, we will be defining and using our own library of functions
from lib.fully_conn import *
from lib.layer_utils import *
from lib.grad_check import *
from lib.datasets import *
from lib.optim import *
from lib.train import *

import numpy as np
import matplotlib.pyplot as plt

# set default size of plots
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) 
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# for auto-reloading external modules
%load_ext autoreload
%autoreload 2

Loading the data (CIFAR-10)

Run the following code block to download CIFAR-10 dataset and load in the properly splitted CIFAR-10 data.

wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz -P data tar -xzvf data/cifar-10-python.tar.gz --directory data rm data/cifar-10-python.tar.gz

Load the dataset.

In [3]:
data = CIFAR10_data()
for k, v in data.items():
    print ("Name: {} Shape: {}".format(k, v.shape))
Name: data_train Shape: (49000, 3, 32, 32)
Name: labels_train Shape: (49000,)
Name: data_val Shape: (1000, 3, 32, 32)
Name: labels_val Shape: (1000,)
Name: data_test Shape: (1000, 3, 32, 32)
Name: labels_test Shape: (1000,)

Training a Network

Train a 2 layer FC network with ReLU activation using the TinyNet class defined in lib/fully_conn.py.

  • (Flatten --> FC --> ReLU --> FC)
In [14]:
# Arrange the data
data_dict = {
    "data_train": (data["data_train"], data["labels_train"]),
    "data_val": (data["data_val"], data["labels_val"]),
    "data_test": (data["data_test"], data["labels_test"])
}
In [15]:
print(data["data_train"].shape)
print(data["labels_train"].shape)
(49000, 3, 32, 32)
(49000,)

Hyperparameters

Note: Since the model is very basic, we aren't expecting to achieve high accuracy. Achieving at least 50% validation accuracy should be enough to validate our functions.

In [16]:
%reload_ext autoreload

model = TinyNet()
loss_f = cross_entropy()
optimizer = SGD(model.net, 1e-4)

results = None

batch_size = 16
epochs = 20
lr_decay = .1
lr_decay_every = 5

input_size=3072
hidden_dim=900
output_dim=10

results = train_net(data_dict, model, loss_f, optimizer, batch_size, epochs, 
                    lr_decay, lr_decay_every, show_every=10000, verbose=True)
opt_params, loss_hist, train_acc_hist, val_acc_hist = results
  • iterations per epoch: 3062
    max iterations:  61240
    max iterations:  61240
    (Iteration 1 / 61240) loss: 10.36330541318323
    (Epoch 1 / 20) Training Accuracy: 0.47075510204081633, Validation Accuracy: 0.426
    (Epoch 2 / 20) Training Accuracy: 0.46, Validation Accuracy: 0.43
    (Epoch 3 / 20) Training Accuracy: 0.48773469387755103, Validation Accuracy: 0.42
    (Iteration 10001 / 61240) loss: 1.146825184245162
    (Epoch 4 / 20) Training Accuracy: 0.5551020408163265, Validation Accuracy: 0.432
    (Epoch 5 / 20) Training Accuracy: 0.5669387755102041, Validation Accuracy: 0.471
    Decaying learning rate of the optimizer to 1e-05
    (Epoch 6 / 20) Training Accuracy: 0.6768163265306123, Validation Accuracy: 0.518
    (Iteration 20001 / 61240) loss: 0.641558470822271
    (Epoch 7 / 20) Training Accuracy: 0.6909387755102041, Validation Accuracy: 0.515
    (Epoch 8 / 20) Training Accuracy: 0.6970612244897959, Validation Accuracy: 0.502
    (Epoch 9 / 20) Training Accuracy: 0.7028979591836735, Validation Accuracy: 0.497
    (Iteration 30001 / 61240) loss: 1.0630496158145197
    (Epoch 10 / 20) Training Accuracy: 0.707469387755102, Validation Accuracy: 0.516
    Decaying learning rate of the optimizer to 1.0000000000000002e-06
    (Epoch 11 / 20) Training Accuracy: 0.7129183673469388, Validation Accuracy: 0.521
    (Epoch 12 / 20) Training Accuracy: 0.7136122448979592, Validation Accuracy: 0.512
    (Epoch 13 / 20) Training Accuracy: 0.7144081632653061, Validation Accuracy: 0.518
    (Iteration 40001 / 61240) loss: 0.8582956537608635
    (Epoch 14 / 20) Training Accuracy: 0.7155714285714285, Validation Accuracy: 0.515
    (Epoch 15 / 20) Training Accuracy: 0.7155510204081633, Validation Accuracy: 0.518
    Decaying learning rate of the optimizer to 1.0000000000000002e-07
    (Epoch 16 / 20) Training Accuracy: 0.7154081632653061, Validation Accuracy: 0.517
    (Iteration 50001 / 61240) loss: 0.8224054741870919
    (Epoch 17 / 20) Training Accuracy: 0.7156530612244898, Validation Accuracy: 0.517
    (Epoch 18 / 20) Training Accuracy: 0.7157142857142857, Validation Accuracy: 0.516
    (Epoch 19 / 20) Training Accuracy: 0.7158367346938775, Validation Accuracy: 0.515
    (Iteration 60001 / 61240) loss: 1.1244308507433312
    (Epoch 20 / 20) Training Accuracy: 0.7158367346938775, Validation Accuracy: 0.515
    
In [18]:
# Demo: How to load the parameters to a newly defined network
model = TinyNet()
model.net.load(opt_params)
val_acc = compute_acc(model, data["data_val"], data["labels_val"])
print ("Validation Accuracy: {}%".format(val_acc*100))
test_acc = compute_acc(model, data["data_test"], data["labels_test"])
print ("Testing Accuracy: {}%".format(test_acc*100))
  • Loading Params: fc1_w Shape: (3072, 900)
    Loading Params: fc1_b Shape: (900,)
    Loading Params: fc1_b Shape: (900,)
    Loading Params: fc2_w Shape: (900, 10)
    Loading Params: fc2_b Shape: (10,)
    Validation Accuracy: 51.5%
    Testing Accuracy: 50.5%
    
In [19]:
# Plot the learning curves
plt.subplot(2, 1, 1)
plt.title('Training loss')
loss_hist_ = loss_hist[1::100] # sparse the curve a bit
plt.plot(loss_hist_, '-o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(train_acc_hist, '-o', label='Training')
plt.plot(val_acc_hist, '-o', label='Validation')
plt.plot([0.5] * len(val_acc_hist), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()

Testing Different Optimizers

SGD + Momentum

SGDM() function in lib/optim.py

The update rule of SGD plus momentum is as shown below:
\begin{equation} v_t: last\ update\ of\ the\ velocity \\ \gamma: momentum \\ \eta: learning\ rate \\ v_t = \gamma v_{t-1} - \eta \nabla_{\theta}J(\theta) \\ \theta = \theta + v_t \end{equation}

In [20]:
%reload_ext autoreload

# Test the implementation of SGD with Momentum
seed = 123
np.random.seed(seed=seed)

N, D = 4, 5
test_sgd = sequential(fc(N, D, name="sgd_fc"))

w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

test_sgd.layers[0].params = {"sgd_fc_w": w}
test_sgd.layers[0].grads = {"sgd_fc_w": dw}

test_sgd_momentum = SGDM(test_sgd, 1e-3, 0.9)
test_sgd_momentum.velocity = {"sgd_fc_w": v}
test_sgd_momentum.step()

updated_w = test_sgd.layers[0].params["sgd_fc_w"]
velocity = test_sgd_momentum.velocity["sgd_fc_w"]

expected_updated_w = np.asarray([
  [ 0.1406,      0.20738947,  0.27417895,  0.34096842,  0.40775789],
  [ 0.47454737,  0.54133684,  0.60812632,  0.67491579,  0.74170526],
  [ 0.80849474,  0.87528421,  0.94207368,  1.00886316,  1.07565263],
  [ 1.14244211,  1.20923158,  1.27602105,  1.34281053,  1.4096    ]])
expected_velocity = np.asarray([
  [ 0.5406,      0.55475789,  0.56891579, 0.58307368,  0.59723158],
  [ 0.61138947,  0.62554737,  0.63970526,  0.65386316,  0.66802105],
  [ 0.68217895,  0.69633684,  0.71049474,  0.72465263,  0.73881053],
  [ 0.75296842,  0.76712632,  0.78128421,  0.79544211,  0.8096    ]])

print ('The following errors should be around or less than 1e-8')
print ('updated_w error: ', rel_error(updated_w, expected_updated_w))
print ('velocity error: ', rel_error(expected_velocity, velocity))
  • The following errors should be around or less than 1e-8
    updated_w error:  8.882347033505819e-09
    updated_w error:  8.882347033505819e-09
    velocity error:  4.269287743278663e-09
    

Comparing SGD and SGD with Momentum [2pt]

The network trained with SGDM optimizer should converge faster.

In [21]:
seed = 123
np.random.seed(seed=seed)

# Arrange a small data
num_train = 4000
small_data_dict = {
    "data_train": (data["data_train"][:num_train], data["labels_train"][:num_train]),
    "data_val": (data["data_val"], data["labels_val"]),
    "data_test": (data["data_test"], data["labels_test"])
}

model_sgd      = FullyConnectedNetwork()
model_sgdm     = FullyConnectedNetwork()
loss_f_sgd     = cross_entropy()
loss_f_sgdm    = cross_entropy()
optimizer_sgd  = SGD(model_sgd.net, 1e-2)
optimizer_sgdm = SGDM(model_sgdm.net, 1e-2, 0.9)

print ("Training with Vanilla SGD...")
results_sgd = train_net(small_data_dict, model_sgd, loss_f_sgd, optimizer_sgd, batch_size=100, 
                        max_epochs=5, show_every=100, verbose=True)

print ("\nTraining with SGD plus Momentum...")
results_sgdm = train_net(small_data_dict, model_sgdm, loss_f_sgdm, optimizer_sgdm, batch_size=100, 
                         max_epochs=5, show_every=100, verbose=True)

opt_params_sgd,  loss_hist_sgd,  train_acc_hist_sgd,  val_acc_hist_sgd  = results_sgd
opt_params_sgdm, loss_hist_sgdm, train_acc_hist_sgdm, val_acc_hist_sgdm = results_sgdm

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgd, 'o', label="Vanilla SGD")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgd, '-o', label="Vanilla SGD")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgd, '-o', label="Vanilla SGD")
         
plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgdm, 'o', label="SGD with Momentum")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgdm, '-o', label="SGD with Momentum")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgdm, '-o', label="SGD with Momentum")
  
for i in [1, 2, 3]:
  plt.subplot(3, 1, i)
  plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)
plt.show()
  • Training with Vanilla SGD...
    iterations per epoch: 40
    iterations per epoch: 40
    max iterations:  200
    (Iteration 1 / 200) loss: 2.784078790273408
    (Epoch 1 / 5) Training Accuracy: 0.287, Validation Accuracy: 0.262
    (Epoch 2 / 5) Training Accuracy: 0.339, Validation Accuracy: 0.298
    (Iteration 101 / 200) loss: 1.888878603563259
    (Epoch 3 / 5) Training Accuracy: 0.34, Validation Accuracy: 0.296
    (Epoch 4 / 5) Training Accuracy: 0.39475, Validation Accuracy: 0.329
    (Epoch 5 / 5) Training Accuracy: 0.416, Validation Accuracy: 0.315
    
    Training with SGD plus Momentum...
    iterations per epoch: 40
    max iterations:  200
    (Iteration 1 / 200) loss: 2.4144157882454107
    (Epoch 1 / 5) Training Accuracy: 0.32475, Validation Accuracy: 0.296
    (Epoch 2 / 5) Training Accuracy: 0.39025, Validation Accuracy: 0.328
    (Iteration 101 / 200) loss: 1.6831844456130503
    (Epoch 3 / 5) Training Accuracy: 0.44625, Validation Accuracy: 0.339
    (Epoch 4 / 5) Training Accuracy: 0.48775, Validation Accuracy: 0.336
    (Epoch 5 / 5) Training Accuracy: 0.49175, Validation Accuracy: 0.344
    

RMSProp

RMSProp() function in lib/optim.py

The update rule of RMSProp is as shown below:
\begin{equation} \gamma: decay\ rate \\ \epsilon: small\ number \\ g_t^2: squared\ gradients \\ \eta: learning\ rate \\ E[g^2]_t: decaying\ average\ of\ past\ squared\ gradients\ at\ update\ step\ t \\ E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma)g_t^2 \\ \theta_{t+1} = \theta_t - \frac{\eta \nabla_{\theta}J(\theta)}{\sqrt{E[g^2]_t+\epsilon}} \end{equation}

In [22]:
%reload_ext autoreload

seed = 123
np.random.seed(seed=seed)

# Test RMSProp implementation; you should see errors less than 1e-7
N, D = 4, 5
test_rms = sequential(fc(N, D, name="rms_fc"))

w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)

test_rms.layers[0].params = {"rms_fc_w": w}
test_rms.layers[0].grads = {"rms_fc_w": dw}

opt_rms = RMSProp(test_rms, 1e-2, 0.99)
opt_rms.cache = {"rms_fc_w": cache}
opt_rms.step()

updated_w = test_rms.layers[0].params["rms_fc_w"]
cache = opt_rms.cache["rms_fc_w"]

expected_updated_w = np.asarray([
  [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],
  [-0.132737,   -0.08078555, -0.02881884,  0.02316247,  0.07515774],
  [ 0.12716641,  0.17918792,  0.23122175,  0.28326742,  0.33532447],
  [ 0.38739248,  0.43947102,  0.49155973,  0.54365823,  0.59576619]])
expected_cache = np.asarray([
  [ 0.5976,      0.6126277,   0.6277108,   0.64284931,  0.65804321],
  [ 0.67329252,  0.68859723,  0.70395734,  0.71937285,  0.73484377],
  [ 0.75037008,  0.7659518,   0.78158892,  0.79728144,  0.81302936],
  [ 0.82883269,  0.84469141,  0.86060554,  0.87657507,  0.8926    ]])

print ('The following errors should be around or less than 1e-7')
print ('updated_w error: ', rel_error(expected_updated_w, updated_w))
print ('cache error: ', rel_error(expected_cache, opt_rms.cache["rms_fc_w"]))
  • The following errors should be around or less than 1e-7
    updated_w error:  9.524687511038133e-08
    updated_w error:  9.524687511038133e-08
    cache error:  2.6477955807156126e-09
    

Adam

Adam() function in lib/optim.py

The update rule of Adam is as shown below:
\begin{equation} t = t + 1 \\ g_t: gradients\ at\ update\ step\ t \\ m_t = \beta_1m_{t-1} + (1-\beta_1)g_t \\ v_t = \beta_2v_{t-1} + (1-\beta_2)g_t^2 \\ \hat{m_t} = m_t / (1 - \beta_1^t) \\ \hat{v_t} = v_t / (1 - \beta_2^t) \\ \theta_{t+1} = \theta_t - \frac{\eta\ \hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon} \\ \end{equation}

In [23]:
%reload_ext autoreload

seed = 123
np.random.seed(seed=seed)

# Test Adam implementation; you should see errors around 1e-7 or less
N, D = 4, 5
test_adam = sequential(fc(N, D, name="adam_fc"))

w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)
dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)
m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)
v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)

test_adam.layers[0].params = {"adam_fc_w": w}
test_adam.layers[0].grads = {"adam_fc_w": dw}

opt_adam = Adam(test_adam, 1e-2, 0.9, 0.999, t=5)
opt_adam.mt = {"adam_fc_w": m}
opt_adam.vt = {"adam_fc_w": v}
opt_adam.step()

updated_w = test_adam.layers[0].params["adam_fc_w"]
mt = opt_adam.mt["adam_fc_w"]
vt = opt_adam.vt["adam_fc_w"]

expected_updated_w = np.asarray([
  [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],
  [-0.1380274,  -0.08544591, -0.03286534,  0.01971428,  0.0722929],
  [ 0.1248705,   0.17744702,  0.23002243,  0.28259667,  0.33516969],
  [ 0.38774145,  0.44031188,  0.49288093,  0.54544852,  0.59801459]])
expected_v = np.asarray([
  [ 0.69966,     0.68908382,  0.67851319,  0.66794809,  0.65738853,],
  [ 0.64683452,  0.63628604,  0.6257431,   0.61520571,  0.60467385,],
  [ 0.59414753,  0.58362676,  0.57311152,  0.56260183,  0.55209767,],
  [ 0.54159906,  0.53110598,  0.52061845,  0.51013645,  0.49966,   ]])
expected_m = np.asarray([
  [ 0.48,        0.49947368,  0.51894737,  0.53842105,  0.55789474],
  [ 0.57736842,  0.59684211,  0.61631579,  0.63578947,  0.65526316],
  [ 0.67473684,  0.69421053,  0.71368421,  0.73315789,  0.75263158],
  [ 0.77210526,  0.79157895,  0.81105263,  0.83052632,  0.85      ]])

print ('The following errors should be around or less than 1e-7')
print ('updated_w error: ', rel_error(expected_updated_w, updated_w))
print ('mt error: ', rel_error(expected_m, mt))
print ('vt error: ', rel_error(expected_v, vt))
  • The following errors should be around or less than 1e-7
    updated_w error:  1.1395691798535431e-07
    updated_w error:  1.1395691798535431e-07
    mt error:  4.214963193114416e-09
    vt error:  4.208314038113071e-09
    

Comparing the optimizers

The SGD with Momentum, RMSProp, and Adam optimizers should work better than Vanilla SGD optimizer.

In [24]:
seed = 123
np.random.seed(seed=seed)

model_rms      = FullyConnectedNetwork()
model_adam     = FullyConnectedNetwork()
loss_f_rms     = cross_entropy()
loss_f_adam    = cross_entropy()
optimizer_rms  = RMSProp(model_rms.net, 5e-4)
optimizer_adam = Adam(model_adam.net, 5e-4)

print ("Training with RMSProp...")
results_rms = train_net(small_data_dict, model_rms, loss_f_rms, optimizer_rms, batch_size=100, 
                        max_epochs=5, show_every=100, verbose=True)

print ("\nTraining with Adam...")
results_adam = train_net(small_data_dict, model_adam, loss_f_adam, optimizer_adam, batch_size=100, 
                         max_epochs=5, show_every=100, verbose=True)

opt_params_rms,  loss_hist_rms,  train_acc_hist_rms,  val_acc_hist_rms  = results_rms
opt_params_adam, loss_hist_adam, train_acc_hist_adam, val_acc_hist_adam = results_adam

plt.subplot(3, 1, 1)
plt.title('Training loss')
plt.xlabel('Iteration')

plt.subplot(3, 1, 2)
plt.title('Training accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 3)
plt.title('Validation accuracy')
plt.xlabel('Epoch')

plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgd, 'o', label="Vanilla SGD")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgd, '-o', label="Vanilla SGD")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgd, '-o', label="Vanilla SGD")
         
plt.subplot(3, 1, 1)
plt.plot(loss_hist_sgdm, 'o', label="SGD with Momentum")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_sgdm, '-o', label="SGD with Momentum")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_sgdm, '-o', label="SGD with Momentum")

plt.subplot(3, 1, 1)
plt.plot(loss_hist_rms, 'o', label="RMSProp")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_rms, '-o', label="RMSProp")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_rms, '-o', label="RMSProp")
         
plt.subplot(3, 1, 1)
plt.plot(loss_hist_adam, 'o', label="Adam")
plt.subplot(3, 1, 2)
plt.plot(train_acc_hist_adam, '-o', label="Adam")
plt.subplot(3, 1, 3)
plt.plot(val_acc_hist_adam, '-o', label="Adam")
  
for i in [1, 2, 3]:
  plt.subplot(3, 1, i)
  plt.legend(loc='upper center', ncol=4)
plt.gcf().set_size_inches(15, 15)
plt.show()
  • Training with RMSProp...
    iterations per epoch: 40
    iterations per epoch: 40
    max iterations:  200
    (Iteration 1 / 200) loss: 2.784078790273408
    (Epoch 1 / 5) Training Accuracy: 0.378, Validation Accuracy: 0.33
    (Epoch 2 / 5) Training Accuracy: 0.4335, Validation Accuracy: 0.358
    (Iteration 101 / 200) loss: 1.7085285722816903
    (Epoch 3 / 5) Training Accuracy: 0.497, Validation Accuracy: 0.361
    (Epoch 4 / 5) Training Accuracy: 0.5475, Validation Accuracy: 0.387
    (Epoch 5 / 5) Training Accuracy: 0.5765, Validation Accuracy: 0.352
    
    Training with Adam...
    iterations per epoch: 40
    max iterations:  200
    (Iteration 1 / 200) loss: 2.4144157882454107
    (Epoch 1 / 5) Training Accuracy: 0.34225, Validation Accuracy: 0.301
    (Epoch 2 / 5) Training Accuracy: 0.41575, Validation Accuracy: 0.372
    (Iteration 101 / 200) loss: 1.7155403213221103
    (Epoch 3 / 5) Training Accuracy: 0.44925, Validation Accuracy: 0.371
    (Epoch 4 / 5) Training Accuracy: 0.4855, Validation Accuracy: 0.371
    (Epoch 5 / 5) Training Accuracy: 0.57625, Validation Accuracy: 0.379
    

Training a Network with Dropout

Compare the results with and without dropout

In [25]:
# Train two identical nets, one with dropout and one without
num_train = 100
data_dict_100 = {
    "data_train": (data["data_train"][:num_train], data["labels_train"][:num_train]),
    "data_val": (data["data_val"], data["labels_val"]),
    "data_test": (data["data_test"], data["labels_test"])
}

solvers = {}
keep_ps = [0, 0.25, 0.50, 0.75]

results_dict = {}
for keep_prob in keep_ps:
    results_dict[keep_prob] = {}

for keep_prob in keep_ps:
    seed = 123
    np.random.seed(seed=seed)

    print ("Dropout Keep Prob =", keep_prob)
    model = DropoutNetTest(keep_prob=keep_prob)
    loss_f = cross_entropy()
    optimizer = SGD(model.net, 1e-4)
    results = train_net(data_dict_100, model, loss_f, optimizer, batch_size=20, 
                        max_epochs=50, show_every=1000, verbose=True)
    opt_params, loss_hist, train_acc_hist, val_acc_hist = results
    results_dict[keep_prob] = {
        "opt_params": opt_params, 
        "loss_hist": loss_hist, 
        "train_acc_hist": train_acc_hist, 
        "val_acc_hist": val_acc_hist
    }
  • Dropout Keep Prob = 0
    iterations per epoch: 5
    iterations per epoch: 5
    max iterations:  250
    (Iteration 1 / 250) loss: 2.8714007645507165
    (Epoch 1 / 50) Training Accuracy: 0.1, Validation Accuracy: 0.089
    (Epoch 2 / 50) Training Accuracy: 0.16, Validation Accuracy: 0.105
    (Epoch 3 / 50) Training Accuracy: 0.17, Validation Accuracy: 0.117
    (Epoch 4 / 50) Training Accuracy: 0.21, Validation Accuracy: 0.126
    (Epoch 5 / 50) Training Accuracy: 0.24, Validation Accuracy: 0.127
    (Epoch 6 / 50) Training Accuracy: 0.28, Validation Accuracy: 0.135
    (Epoch 7 / 50) Training Accuracy: 0.34, Validation Accuracy: 0.134
    (Epoch 8 / 50) Training Accuracy: 0.36, Validation Accuracy: 0.138
    (Epoch 9 / 50) Training Accuracy: 0.4, Validation Accuracy: 0.143
    (Epoch 10 / 50) Training Accuracy: 0.42, Validation Accuracy: 0.146
    (Epoch 11 / 50) Training Accuracy: 0.49, Validation Accuracy: 0.147
    (Epoch 12 / 50) Training Accuracy: 0.52, Validation Accuracy: 0.153
    (Epoch 13 / 50) Training Accuracy: 0.52, Validation Accuracy: 0.152
    (Epoch 14 / 50) Training Accuracy: 0.6, Validation Accuracy: 0.159
    (Epoch 15 / 50) Training Accuracy: 0.61, Validation Accuracy: 0.165
    (Epoch 16 / 50) Training Accuracy: 0.66, Validation Accuracy: 0.163
    (Epoch 17 / 50) Training Accuracy: 0.66, Validation Accuracy: 0.169
    (Epoch 18 / 50) Training Accuracy: 0.68, Validation Accuracy: 0.173
    (Epoch 19 / 50) Training Accuracy: 0.68, Validation Accuracy: 0.172
    (Epoch 20 / 50) Training Accuracy: 0.71, Validation Accuracy: 0.171
    (Epoch 21 / 50) Training Accuracy: 0.74, Validation Accuracy: 0.17
    (Epoch 22 / 50) Training Accuracy: 0.76, Validation Accuracy: 0.17
    (Epoch 23 / 50) Training Accuracy: 0.78, Validation Accuracy: 0.17
    (Epoch 24 / 50) Training Accuracy: 0.78, Validation Accuracy: 0.169
    (Epoch 25 / 50) Training Accuracy: 0.79, Validation Accuracy: 0.173
    (Epoch 26 / 50) Training Accuracy: 0.81, Validation Accuracy: 0.175
    (Epoch 27 / 50) Training Accuracy: 0.85, Validation Accuracy: 0.174
    (Epoch 28 / 50) Training Accuracy: 0.87, Validation Accuracy: 0.177
    (Epoch 29 / 50) Training Accuracy: 0.88, Validation Accuracy: 0.178
    (Epoch 30 / 50) Training Accuracy: 0.88, Validation Accuracy: 0.178
    (Epoch 31 / 50) Training Accuracy: 0.88, Validation Accuracy: 0.18
    (Epoch 32 / 50) Training Accuracy: 0.89, Validation Accuracy: 0.182
    (Epoch 33 / 50) Training Accuracy: 0.9, Validation Accuracy: 0.184
    (Epoch 34 / 50) Training Accuracy: 0.92, Validation Accuracy: 0.185
    (Epoch 35 / 50) Training Accuracy: 0.93, Validation Accuracy: 0.184
    (Epoch 36 / 50) Training Accuracy: 0.93, Validation Accuracy: 0.187
    (Epoch 37 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.187
    (Epoch 38 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.186
    (Epoch 39 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.185
    (Epoch 40 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.186
    (Epoch 41 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.19
    (Epoch 42 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.189
    (Epoch 43 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.191
    (Epoch 44 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.19
    (Epoch 45 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.19
    (Epoch 46 / 50) Training Accuracy: 0.96, Validation Accuracy: 0.193
    (Epoch 47 / 50) Training Accuracy: 0.96, Validation Accuracy: 0.193
    (Epoch 48 / 50) Training Accuracy: 0.96, Validation Accuracy: 0.192
    (Epoch 49 / 50) Training Accuracy: 0.96, Validation Accuracy: 0.192
    (Epoch 50 / 50) Training Accuracy: 0.97, Validation Accuracy: 0.193
    Dropout Keep Prob = 0.25
    iterations per epoch: 5
    max iterations:  250
    (Iteration 1 / 250) loss: 3.173109720249681
    (Epoch 1 / 50) Training Accuracy: 0.13, Validation Accuracy: 0.095
    (Epoch 2 / 50) Training Accuracy: 0.14, Validation Accuracy: 0.106
    (Epoch 3 / 50) Training Accuracy: 0.16, Validation Accuracy: 0.106
    (Epoch 4 / 50) Training Accuracy: 0.15, Validation Accuracy: 0.121
    (Epoch 5 / 50) Training Accuracy: 0.17, Validation Accuracy: 0.127
    (Epoch 6 / 50) Training Accuracy: 0.2, Validation Accuracy: 0.131
    (Epoch 7 / 50) Training Accuracy: 0.22, Validation Accuracy: 0.142
    (Epoch 8 / 50) Training Accuracy: 0.26, Validation Accuracy: 0.14
    (Epoch 9 / 50) Training Accuracy: 0.28, Validation Accuracy: 0.143
    (Epoch 10 / 50) Training Accuracy: 0.32, Validation Accuracy: 0.15
    (Epoch 11 / 50) Training Accuracy: 0.36, Validation Accuracy: 0.154
    (Epoch 12 / 50) Training Accuracy: 0.39, Validation Accuracy: 0.157
    (Epoch 13 / 50) Training Accuracy: 0.4, Validation Accuracy: 0.163
    (Epoch 14 / 50) Training Accuracy: 0.4, Validation Accuracy: 0.165
    (Epoch 15 / 50) Training Accuracy: 0.41, Validation Accuracy: 0.166
    (Epoch 16 / 50) Training Accuracy: 0.45, Validation Accuracy: 0.168
    (Epoch 17 / 50) Training Accuracy: 0.48, Validation Accuracy: 0.173
    (Epoch 18 / 50) Training Accuracy: 0.5, Validation Accuracy: 0.173
    (Epoch 19 / 50) Training Accuracy: 0.49, Validation Accuracy: 0.176
    (Epoch 20 / 50) Training Accuracy: 0.49, Validation Accuracy: 0.18
    (Epoch 21 / 50) Training Accuracy: 0.54, Validation Accuracy: 0.188
    (Epoch 22 / 50) Training Accuracy: 0.57, Validation Accuracy: 0.184
    (Epoch 23 / 50) Training Accuracy: 0.56, Validation Accuracy: 0.189
    (Epoch 24 / 50) Training Accuracy: 0.59, Validation Accuracy: 0.191
    (Epoch 25 / 50) Training Accuracy: 0.59, Validation Accuracy: 0.187
    (Epoch 26 / 50) Training Accuracy: 0.63, Validation Accuracy: 0.192
    (Epoch 27 / 50) Training Accuracy: 0.64, Validation Accuracy: 0.19
    (Epoch 28 / 50) Training Accuracy: 0.64, Validation Accuracy: 0.189
    (Epoch 29 / 50) Training Accuracy: 0.65, Validation Accuracy: 0.186
    (Epoch 30 / 50) Training Accuracy: 0.66, Validation Accuracy: 0.19
    (Epoch 31 / 50) Training Accuracy: 0.68, Validation Accuracy: 0.194
    (Epoch 32 / 50) Training Accuracy: 0.71, Validation Accuracy: 0.195
    (Epoch 33 / 50) Training Accuracy: 0.71, Validation Accuracy: 0.193
    (Epoch 34 / 50) Training Accuracy: 0.7, Validation Accuracy: 0.195
    (Epoch 35 / 50) Training Accuracy: 0.71, Validation Accuracy: 0.203
    (Epoch 36 / 50) Training Accuracy: 0.72, Validation Accuracy: 0.201
    (Epoch 37 / 50) Training Accuracy: 0.71, Validation Accuracy: 0.198
    (Epoch 38 / 50) Training Accuracy: 0.73, Validation Accuracy: 0.201
    (Epoch 39 / 50) Training Accuracy: 0.76, Validation Accuracy: 0.204
    (Epoch 40 / 50) Training Accuracy: 0.75, Validation Accuracy: 0.197
    (Epoch 41 / 50) Training Accuracy: 0.74, Validation Accuracy: 0.198
    (Epoch 42 / 50) Training Accuracy: 0.76, Validation Accuracy: 0.203
    (Epoch 43 / 50) Training Accuracy: 0.77, Validation Accuracy: 0.201
    (Epoch 44 / 50) Training Accuracy: 0.77, Validation Accuracy: 0.206
    (Epoch 45 / 50) Training Accuracy: 0.78, Validation Accuracy: 0.204
    (Epoch 46 / 50) Training Accuracy: 0.78, Validation Accuracy: 0.2
    (Epoch 47 / 50) Training Accuracy: 0.79, Validation Accuracy: 0.198
    (Epoch 48 / 50) Training Accuracy: 0.8, Validation Accuracy: 0.197
    (Epoch 49 / 50) Training Accuracy: 0.8, Validation Accuracy: 0.197
    (Epoch 50 / 50) Training Accuracy: 0.83, Validation Accuracy: 0.204
    Dropout Keep Prob = 0.5
    iterations per epoch: 5
    max iterations:  250
    (Iteration 1 / 250) loss: 3.53050836884461
    (Epoch 1 / 50) Training Accuracy: 0.11, Validation Accuracy: 0.089
    (Epoch 2 / 50) Training Accuracy: 0.16, Validation Accuracy: 0.106
    (Epoch 3 / 50) Training Accuracy: 0.17, Validation Accuracy: 0.112
    (Epoch 4 / 50) Training Accuracy: 0.17, Validation Accuracy: 0.127
    (Epoch 5 / 50) Training Accuracy: 0.21, Validation Accuracy: 0.126
    (Epoch 6 / 50) Training Accuracy: 0.24, Validation Accuracy: 0.136
    (Epoch 7 / 50) Training Accuracy: 0.26, Validation Accuracy: 0.14
    (Epoch 8 / 50) Training Accuracy: 0.31, Validation Accuracy: 0.143
    (Epoch 9 / 50) Training Accuracy: 0.35, Validation Accuracy: 0.153
    (Epoch 10 / 50) Training Accuracy: 0.38, Validation Accuracy: 0.153
    (Epoch 11 / 50) Training Accuracy: 0.38, Validation Accuracy: 0.153
    (Epoch 12 / 50) Training Accuracy: 0.41, Validation Accuracy: 0.154
    (Epoch 13 / 50) Training Accuracy: 0.44, Validation Accuracy: 0.156
    (Epoch 14 / 50) Training Accuracy: 0.43, Validation Accuracy: 0.15
    (Epoch 15 / 50) Training Accuracy: 0.46, Validation Accuracy: 0.159
    (Epoch 16 / 50) Training Accuracy: 0.51, Validation Accuracy: 0.16
    (Epoch 17 / 50) Training Accuracy: 0.52, Validation Accuracy: 0.165
    (Epoch 18 / 50) Training Accuracy: 0.55, Validation Accuracy: 0.175
    (Epoch 19 / 50) Training Accuracy: 0.56, Validation Accuracy: 0.177
    (Epoch 20 / 50) Training Accuracy: 0.6, Validation Accuracy: 0.173
    (Epoch 21 / 50) Training Accuracy: 0.62, Validation Accuracy: 0.178
    (Epoch 22 / 50) Training Accuracy: 0.62, Validation Accuracy: 0.177
    (Epoch 23 / 50) Training Accuracy: 0.61, Validation Accuracy: 0.178
    (Epoch 24 / 50) Training Accuracy: 0.63, Validation Accuracy: 0.179
    (Epoch 25 / 50) Training Accuracy: 0.65, Validation Accuracy: 0.182
    (Epoch 26 / 50) Training Accuracy: 0.67, Validation Accuracy: 0.181
    (Epoch 27 / 50) Training Accuracy: 0.72, Validation Accuracy: 0.188
    (Epoch 28 / 50) Training Accuracy: 0.72, Validation Accuracy: 0.19
    (Epoch 29 / 50) Training Accuracy: 0.72, Validation Accuracy: 0.187
    (Epoch 30 / 50) Training Accuracy: 0.75, Validation Accuracy: 0.191
    (Epoch 31 / 50) Training Accuracy: 0.76, Validation Accuracy: 0.198
    (Epoch 32 / 50) Training Accuracy: 0.77, Validation Accuracy: 0.194
    (Epoch 33 / 50) Training Accuracy: 0.77, Validation Accuracy: 0.195
    (Epoch 34 / 50) Training Accuracy: 0.78, Validation Accuracy: 0.199
    (Epoch 35 / 50) Training Accuracy: 0.8, Validation Accuracy: 0.199
    (Epoch 36 / 50) Training Accuracy: 0.81, Validation Accuracy: 0.195
    (Epoch 37 / 50) Training Accuracy: 0.82, Validation Accuracy: 0.197
    (Epoch 38 / 50) Training Accuracy: 0.82, Validation Accuracy: 0.194
    (Epoch 39 / 50) Training Accuracy: 0.84, Validation Accuracy: 0.195
    (Epoch 40 / 50) Training Accuracy: 0.84, Validation Accuracy: 0.196
    (Epoch 41 / 50) Training Accuracy: 0.85, Validation Accuracy: 0.191
    (Epoch 42 / 50) Training Accuracy: 0.85, Validation Accuracy: 0.192
    (Epoch 43 / 50) Training Accuracy: 0.87, Validation Accuracy: 0.196
    (Epoch 44 / 50) Training Accuracy: 0.88, Validation Accuracy: 0.194
    (Epoch 45 / 50) Training Accuracy: 0.89, Validation Accuracy: 0.198
    (Epoch 46 / 50) Training Accuracy: 0.9, Validation Accuracy: 0.197
    (Epoch 47 / 50) Training Accuracy: 0.9, Validation Accuracy: 0.2
    (Epoch 48 / 50) Training Accuracy: 0.9, Validation Accuracy: 0.198
    (Epoch 49 / 50) Training Accuracy: 0.91, Validation Accuracy: 0.198
    (Epoch 50 / 50) Training Accuracy: 0.91, Validation Accuracy: 0.198
    Dropout Keep Prob = 0.75
    iterations per epoch: 5
    max iterations:  250
    (Iteration 1 / 250) loss: 2.839718780769512
    (Epoch 1 / 50) Training Accuracy: 0.1, Validation Accuracy: 0.089
    (Epoch 2 / 50) Training Accuracy: 0.16, Validation Accuracy: 0.108
    (Epoch 3 / 50) Training Accuracy: 0.16, Validation Accuracy: 0.115
    (Epoch 4 / 50) Training Accuracy: 0.18, Validation Accuracy: 0.129
    (Epoch 5 / 50) Training Accuracy: 0.21, Validation Accuracy: 0.128
    (Epoch 6 / 50) Training Accuracy: 0.26, Validation Accuracy: 0.138
    (Epoch 7 / 50) Training Accuracy: 0.28, Validation Accuracy: 0.131
    (Epoch 8 / 50) Training Accuracy: 0.31, Validation Accuracy: 0.143
    (Epoch 9 / 50) Training Accuracy: 0.34, Validation Accuracy: 0.143
    (Epoch 10 / 50) Training Accuracy: 0.38, Validation Accuracy: 0.15
    (Epoch 11 / 50) Training Accuracy: 0.4, Validation Accuracy: 0.155
    (Epoch 12 / 50) Training Accuracy: 0.43, Validation Accuracy: 0.157
    (Epoch 13 / 50) Training Accuracy: 0.45, Validation Accuracy: 0.16
    (Epoch 14 / 50) Training Accuracy: 0.5, Validation Accuracy: 0.165
    (Epoch 15 / 50) Training Accuracy: 0.53, Validation Accuracy: 0.163
    (Epoch 16 / 50) Training Accuracy: 0.55, Validation Accuracy: 0.168
    (Epoch 17 / 50) Training Accuracy: 0.57, Validation Accuracy: 0.169
    (Epoch 18 / 50) Training Accuracy: 0.6, Validation Accuracy: 0.173
    (Epoch 19 / 50) Training Accuracy: 0.64, Validation Accuracy: 0.176
    (Epoch 20 / 50) Training Accuracy: 0.65, Validation Accuracy: 0.178
    (Epoch 21 / 50) Training Accuracy: 0.66, Validation Accuracy: 0.176
    (Epoch 22 / 50) Training Accuracy: 0.67, Validation Accuracy: 0.176
    (Epoch 23 / 50) Training Accuracy: 0.69, Validation Accuracy: 0.174
    (Epoch 24 / 50) Training Accuracy: 0.72, Validation Accuracy: 0.175
    (Epoch 25 / 50) Training Accuracy: 0.74, Validation Accuracy: 0.174
    (Epoch 26 / 50) Training Accuracy: 0.75, Validation Accuracy: 0.176
    (Epoch 27 / 50) Training Accuracy: 0.77, Validation Accuracy: 0.178
    (Epoch 28 / 50) Training Accuracy: 0.77, Validation Accuracy: 0.178
    (Epoch 29 / 50) Training Accuracy: 0.79, Validation Accuracy: 0.181
    (Epoch 30 / 50) Training Accuracy: 0.81, Validation Accuracy: 0.187
    (Epoch 31 / 50) Training Accuracy: 0.82, Validation Accuracy: 0.193
    (Epoch 32 / 50) Training Accuracy: 0.82, Validation Accuracy: 0.196
    (Epoch 33 / 50) Training Accuracy: 0.85, Validation Accuracy: 0.198
    (Epoch 34 / 50) Training Accuracy: 0.86, Validation Accuracy: 0.197
    (Epoch 35 / 50) Training Accuracy: 0.86, Validation Accuracy: 0.197
    (Epoch 36 / 50) Training Accuracy: 0.87, Validation Accuracy: 0.195
    (Epoch 37 / 50) Training Accuracy: 0.89, Validation Accuracy: 0.198
    (Epoch 38 / 50) Training Accuracy: 0.9, Validation Accuracy: 0.195
    (Epoch 39 / 50) Training Accuracy: 0.9, Validation Accuracy: 0.196
    (Epoch 40 / 50) Training Accuracy: 0.92, Validation Accuracy: 0.197
    (Epoch 41 / 50) Training Accuracy: 0.91, Validation Accuracy: 0.195
    (Epoch 42 / 50) Training Accuracy: 0.92, Validation Accuracy: 0.194
    (Epoch 43 / 50) Training Accuracy: 0.92, Validation Accuracy: 0.197
    (Epoch 44 / 50) Training Accuracy: 0.92, Validation Accuracy: 0.193
    (Epoch 45 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.197
    (Epoch 46 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.201
    (Epoch 47 / 50) Training Accuracy: 0.94, Validation Accuracy: 0.195
    (Epoch 48 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.192
    (Epoch 49 / 50) Training Accuracy: 0.95, Validation Accuracy: 0.193
    (Epoch 50 / 50) Training Accuracy: 0.96, Validation Accuracy: 0.193
    
In [26]:
# Plot train and validation accuracies of the two models
train_accs = []
val_accs = []
for keep_prob in keep_ps:
    curr_dict = results_dict[keep_prob]
    train_accs.append(curr_dict["train_acc_hist"][-1])
    val_accs.append(curr_dict["val_acc_hist"][-1])

plt.subplot(3, 1, 1)
for keep_prob in keep_ps:
    curr_dict = results_dict[keep_prob]
    plt.plot(curr_dict["train_acc_hist"], 'o', label='%.2f dropout' % keep_prob)
plt.title('Train accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')
  
plt.subplot(3, 1, 2)
for keep_prob in keep_ps:
    curr_dict = results_dict[keep_prob]
    plt.plot(curr_dict["val_acc_hist"], 'o', label='%.2f dropout' % keep_prob)
plt.title('Val accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(ncol=2, loc='lower right')

plt.gcf().set_size_inches(15, 15)
plt.show()

Observations on Dropout

During training, after a few epochs (models perform similarly prior to 3 epochs), the higher the keep prob (lower dropout rate), the higher the accuracy, with 0 dropout having the highest training accuracy throughout. This can reasonably be explained by, as the number of epochs increases, the models with dropout are fitted to more of the training set, decreasing the differences between models and increasing the training accuracy. Given enough epochs, all the models would be expected to eventually converge to near perfect training accuracy (overfitting).

When it comes to validation on unseen data, dropout improves model generalizability and thus can improve accuracy scores. However, the relationship between the keep probability and the accuracy to improve generalization is not as linear or clear as that of overfitting and training accuracy.

Plot the Activation Functions

In [27]:
left, right = -10, 10
X  = np.linspace(left, right, 100)
XS = np.linspace(-5, 5, 10)
lw = 4
alpha = 0.1 # alpha for leaky_relu
elu_alpha = 0.5
selu_alpha = 1.6732
selu_scale = 1.0507

### Defining the activation functions ###
sigmoid = lambda x: 1 / (1 + np.exp(-x))
leaky_relu = lambda x: np.maximum(alpha*x,x)
relu = lambda x: np.maximum(x,0)
elu = lambda x: np.array([xi if xi>0 else elu_alpha*(np.exp(xi)-1) for xi in x])
selu = lambda x: selu_scale * ((x > 0)*x + (x <= 0) * (selu_alpha * np.exp(x) - selu_alpha))
tanh = lambda x: (np.exp(x)-np.exp(-x)) / (np.exp(x) + np.exp(-x))
#########################################

activations = {
    "Sigmoid": sigmoid,
    "LeakyReLU": leaky_relu,
    "ReLU": relu,
    "ELU": elu,
    "SeLU": selu,
    "Tanh": tanh
}

# Ground Truth activations
GT_Act = {
    "Sigmoid": [0.00669285092428, 0.0200575365379, 0.0585369028744, 0.158869104881, 0.364576440742, 
                0.635423559258, 0.841130895119, 0.941463097126, 0.979942463462, 0.993307149076],
    "LeakyReLU": [-0.5, -0.388888888889, -0.277777777778, -0.166666666667, -0.0555555555556, 
                  0.555555555556, 1.66666666667, 2.77777777778, 3.88888888889, 5.0],
    "ReLU": [-0.0, -0.0, -0.0, -0.0, -0.0, 0.555555555556, 1.66666666667, 2.77777777778, 3.88888888889, 5.0],
    "ELU": [-0.4966310265, -0.489765962143, -0.468911737989, -0.405562198581, -0.213123289631, 
            0.555555555556, 1.66666666667, 2.77777777778, 3.88888888889, 5.0],
    "SeLU": [-1.74618571868, -1.72204772347, -1.64872296837, -1.42598202974, -0.749354802287, 
             0.583722222222, 1.75116666667, 2.91861111111, 4.08605555556, 5.2535],
    "Tanh": [-0.999909204263, -0.999162466631, -0.992297935288, -0.931109608668, -0.504672397722, 
             0.504672397722, 0.931109608668, 0.992297935288, 0.999162466631, 0.999909204263]
} 

for label in activations:
    fig = plt.figure(figsize=(4,4))
    ax = fig.add_subplot(1, 1, 1)
    ax.plot(X, activations[label](X), color='darkorchid', lw=lw, label=label)
    assert rel_error(activations[label](XS), GT_Act[label]) < 1e-9, \
           "Your implementation of {} might be wrong".format(label)
    ax.legend(loc="lower right")
    ax.axhline(0, color='black')
    ax.axvline(0, color='black')
    ax.set_title('{}'.format(label), fontsize=14)
    plt.xlabel(r"X")
    plt.ylabel(r"Y")
    plt.show()