In this post we go through the code for a multilayer perceptron in TensorFlow. We will use Aymeric Damien’s implementation. I recommend you have a skim before you read this post. I have included the key portions of the code below.

#### 1. Code

Here are the relevant network parameters and graph input for context (skim this):

1 2 3 4 5 6 7 8 9 |
# Network Parameters n_hidden_1 = 256 # 1st layer number of features n_hidden_2 = 256 # 2nd layer number of features n_input = 784 # MNIST data input (img shape: 28*28) n_classes = 10 # MNIST total classes (0-9 digits) # tf Graph input x = tf.placeholder("float", [None, n_input]) y = tf.placeholder("float", [None, n_classes]) |

**Here is the model** (I will explain this below):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def multilayer_perceptron(x, weights, biases): # Hidden layer with ReLU activation layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1']) layer_1 = tf.nn.relu(layer_1) # Hidden layer with ReLU activation layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2']) layer_2 = tf.nn.relu(layer_2) # Output layer with linear activation out_layer = tf.matmul(layer_2, weights['out']) + biases['out'] return out_layer # Store layers weight & bias weights = { 'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])), 'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])), 'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes])) } biases = { 'b1': tf.Variable(tf.random_normal([n_hidden_1])), 'b2': tf.Variable(tf.random_normal([n_hidden_2])), 'out': tf.Variable(tf.random_normal([n_classes])) } # Construct model pred = multilayer_perceptron(x, weights, biases) |

**2. Translating the code: What is a multilayer perceptron?**

Let’s draw the model the function multilayer_perceptron represents. I’ll assume that we are handling a batch of 100 training examples:

It’s ‘multi-layer’ because there is more than one hidden layer. Here’s the code for a hidden layer again so you check it corresponds exactly:

1 2 3 |
# Hidden layer with ReLU activation layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1']) layer_1 = tf.nn.relu(layer_1) |

In its entirety, the quoted code above does three things:

- It assigns values to network parameters (number of features extracted from each hidden layer n_hidden_1 and n_hidden_2, number of features of input n_input and number of classes of output y n_classes).
- It defines the model in model_perceptron()
- It initialises and stores the weights and biases (W and b) for each of the three layers in the model.

**What is the xW + b part?**

We are giving weights to every feature (in the first layer this is each pixel, in the second layer this is every feature we extracted in the first layer). These weights are a proxy to how important the features are in making up the next set of features. I explain this in more detail in the second section of this post on the classification process.

Often the features in the hidden layers are not easily interpretable by humans.

**What is ReLU activation?**

.

The important thing to note is that **it’s non-linear** (as opposed to the xW+b part, which is linear.)

Why do we need to add non-linearities? Because if not, the entire network could collapse to one layer.

**Activation** just means output. The linear activation in the last layer of this model means ‘return the output without doing anything more (like ReLU) to it’.

So the **output returned is just n_classes numbers**, one for each class. In the case of MNIST, that’s 10 numbers because there are 10 possible outcomes (images are digits ranging from 0 to 9).

**Initialisation of the weights and biases: What is tf.random_normal([a,b])?**

It generates a matrix with shape [a,b] (a rows and b columns) with values randomly drawn from a normal distribution with mean 0 and variance 1.

**How do you choose the number of features extracted from each layer (
n_hidden_1 and
n_hidden_2)?**

An empirically-derived rule of thumb is ‘*the optimal size of the hidden layer is usually between the size of the input and size of the output layers*‘. Specifically, one *could* choose the number of features extracted (often referred to as neurons) in the layer to equal the mean of the number of features in the input and output layers. (Source: doug on Stats StackExchange)

There doesn’t seem to be a theory-based rule for choosing these values.

**3. How would you adapt this model for different problems?**

We have not discussed training the model yet, but if you wanted to use this model for a different problem, you would **change the values of
n_input and
n_classes**.

For example, with our traffic sign classifier, n_input would equal 32*32*3 = 3072. Our image is 32 pixels by 32 pixels and each pixel has three colour channels (Red, Green, Blue). Note that n_input represents the number of input features each example has and not the number of examples you are feeding into your model. n_classes would equal 43 because we have 43 different traffic signs our image could possibly be.

Further reading:

- Aymeric Damien’s implementation of a multilayer perceptron on MNIST.
- TensorFlow’s implementation of a multilayer perceptron on MNIST. You can’t run this file directly but I recommend going through this as well because it’s incredibly clear, gives you practice reading people’s code and teaches you how to write readable code by example.
- A Classification Process Outlined in Simple Terms (Post on the process of logistic classification)
- Explaining TensorFlow Code for a Convolutional Neural Network (Post like this one, except I explain code for a CNN)

*Feature image creds: **blog.dbrgn.ch*