Neural Networks & MLPs
It has been a long standing task to create machines that can act and reason in a similar fashion as humans do. And while there has been lots of progress in artificial intelligence (AI) and machine learning in recent years some of the groundwork has already been laid out more than 60 years ago. These early concepts drew their inspiration from theoretical principles of how biological neural networks such as the human brain work. In 1943 McCulloch and Pitts published a paper describing the relationships of (artificial) neurons in networks based on their “all-or-none” activity characteristic. This “all-or-none” characteristic refers to the fact that a biological neuron either responds to a stimulation or remains silent, there is no in between. A direct observation of this behavior can for example be seen in micro electrode recordings form the human brain. After this initial paper on artificial neural networks Frank Rosenblatt in 1957 published a paper entitled “The Perceptron — A Perceiving and Recognizing Automaton”. The Perceptron is a supervised linear classifier that uses adjustable weights to assign an input vector to a class. Similar to the 1943 McCulloch and Pitts paper the idea behind the Perceptron is to resemble the computations of biological neurons to create an agent that can learn. In the following we will have a look on the idea behind the Perceptron and.
As mentioned above, Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule based on the original MCP neuron. A Perceptron is an algorithm for supervised learning of binary classifiers. This algorithm enables neurons to learn and processes elements in the training set one at a time.
1)Inputs are feature values 2)Each feature has a weight 3)Sum is the activation
The following figure shows the structure of how does a pereptron work and Xi and Wi stand for feature i and weight for feature i respectively.
here are two types of Perceptrons: Single layer and Multilayer.
Single layer: Single layer perceptrons can learn only linearly separable patterns
Multilayer: Multilayer perceptrons or feedforward neural networks with two or more layers have the greater processing power The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision boundary.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a certain threshold, it either outputs a signal or does not return an output. In the context of supervised learning and classification, this can then be used to predict the class of a sample. The subsequent figure shows that perceptron enables you to distinguish between the two linearly separable classes +1 and -1.
Let's talk about it more mathematically:
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight coefficient; an output value ”f(x)”is generated.
In the equation given above:
“w” = vector of real-valued weights
“b” = bias (an element that adjusts the boundary away from origin without any dependence on the input value)
“x” = vector of input x values
The output can be represented as “1” or “0.” It can also be represented as “1” or “-1” depending on which activation function is used.
In the space of feature vectors
*Examples are points
*Any weight vector is a hyperplane
*One side corresponds to Y=+1
*Other corresponds to Y=-1
The following figure shows how the perceptron used to discriminate between two linearly separable classes. in this figure, the activation function is sigmoid.
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., y=y*), no change!
If wrong: adjust the weight vector : How? see below:
if the training data is seprable then the perceptron training will converge. you can find the proof of convergence in this site
We've already talked about what problems a perceptron can solve. But let's consider the following configuration known as the XOR Problem.
There are two independent variables $x_1$ and $x_2$ and the output variable is $y$. Let's analyze this configuration in a cartesian plain:
Here, the green points represent the points where $y$ is equal to 1 and the red points respresent the ones with zero output. Here comes the problem: Can we find a single straight line that can successfully separate these two classes? The answer is NO! No matter how we choose the line, there will always be at least 1 misclassified tuple. This is beacuse these to classes are not "linearly separable". We tackle this problem using Multi-Layer Perceptrons (MLPs).
MLPs consist of several layers of perceptrons, each one trailed by a non-linear activation function. This non-linearity will result in the ability to separate classes which are not linearly separable. To see how this is possible, consider the XOR problem and the following MLP:
The perceptrons in the middle, are followed by a non-linear sign function. Suppose that the these two perceptrons, make up two linear boundries as shown in the following figure (a):
Now the sign functions will map the outputs to the points shown in (b). Is it possible to separate these points with another perceptron? Yes! and that's the beauty of MLPs.
The Most important issue to address in MLPs is the way we train them, i.e. the method by which we find $w$'s and $b$'s in each layer. The method is called Backpropagation. The math behind backpropagation is extensive and it can't be explained here. You can find a thorough explanation in Russel and Norvig texbook. A step by step implementation can be found here.
In a nutshell, it involves defining a loss function $J$, which represents the deviation of predicted outputs ($\hat{y}$s) from the actual outputs ($y$s) of the training set, and using the chain rule for calculating the derivatives needed to update the weights by the use of stochastic gradient descent algorithm. Finally, this would lead to the following algorithm:
In python, MLPs are implemented in scikit-learn library. A sample code is as follows. Be careful when using MLPs! Always normalize the input vectors because MLPs are very sensitive to the scale of the inputs.
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=100, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=1)
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
clf.predict(X_test[:5, :])
clf.score(X_test, y_test)
What about the XOR problem?
import numpy as np
train_data = np.array(
[
[0, 0],
[0, 1],
[1, 0],
[1, 1]])
target_xor = np.array(
[0,
1,
1,
0])
clf = MLPClassifier(random_state=1, max_iter=300).fit(train_data, target_xor)
clf.predict(train_data)
clf.score(train_data, target_xor)
See? Perfect classification!
There are several activation functions. You can see some the most famous activation functions in the figure below:
The first two (sigmoid and tanh) are rarely used today, because theire derivative is nearly zero except for a small range of inputs. This causes a problem called the "vanishing gradient" in backpropagation that prevents the weights to be properly trained.
In high dimensional inputs such as images or videos, 2 (or more) dimensional inputs need to be converted to 1 dimensional vectors. This increases the number of trainable parameters exponentially. For instance, for a 100$\times$100$\times$3 image 100$\times$100$\times$3 = 30000 weights are needed just for a single neuron in the 2nd layer and this results in taking storage and processing capability and also increase chance of overfitting. Because of aforementioned disadvantages of MLP, a new neural network-based model was devised for spatial correlated data such as image.
Convolutional Neural Networks have a different architecture than regular Neural Networks like MLP. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions.
Convolutional Neural Networks are a bit different. First of all, the layers are organized in 3 dimensions: width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension.
The most common use for CNNs is image classification, for example identifying satellite images that contain roads or classifying hand written letters and digits. There are other quite mainstream tasks such as image segmentation and signal processing, for which CNNs perform well at. CNNs also have been used for understanding in Natural Language Processing (NLP) and speech recognition.
a simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. Three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks)
When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume (fully connected). Instead, each neuron connects to only a local region of the input volume. This region is typically 33 or 55 but in general it is a hyperparameter called the filter size and must be tuned based on the problem.
The CONV layer’s parameters consist of a set of learnable filters (Also known as kernels). As mentioned before, every filter is small spatially (along width and height), but extends through the full depth of the input volume. Each filter convolves (slides) across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As filter slide over the width and height of the input volume, it will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Stride is the size of the step the convolution filter moves each time. A stride size is usually 1, meaning the filter slides pixel by pixel. By increasing the stride size, filter is sliding over the input with a larger interval and thus has less overlap between the cells. In below, procedure of convolving on an image is shown.
One important point in convolutional layers is that weights and biases are constant in each depth; which means for a volume of size [55x55x96] the first Conv Layer would have only 96 unique set of weights. In this way all neurons in a single depth slice are using the same weight vector and the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Reason of the name: Convolutional Layer). This is why it is common to refer to the sets of weights as a filter, that is convolved with the input.
for example two below are gray-scaled dog image and the same image after a filter applied on it.
Note: With image size(n,n) as the input and filter size (f,f), dimension of output will be (n-f+1, n-f+1).
Example 1. For example, suppose that the input volume has size [32x32x3] and filter size is 5x5. Output’s dimension would be (28,28,3)
Example 2. In the above example each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 553 = 75 weights (and +1 bias parameter).
Because the size of the feature map is always smaller than the input, we have to do something to prevent our feature map from shrinking. This is where we use padding. A layer of zero-value pixels is added to surround the input with zeros, so that our feature map will not shrink. In addition to keeping the spatial size constant after performing convolution, padding also improves performance and makes sure the kernel and stride size will fit in the input.
After a convolution layer, it is common to add a pooling layer in between CNN layers. The function of pooling is to continuously reduce the dimensionality to reduce the number of parameters and computation in the network. Pooling is summarizing or aggregation of information in the pooling window size by functions like max or average. This shortens the training time and controls overfitting. The most frequent type of pooling is max pooling, which takes the maximum value in each window. These window sizes need to be specified beforehand. This decreases the feature map size while at the same time keeping the significant information
After extracting useful patterns from image (or any other kind of input), outputs of conv and pooling layers should be flatten to a 1-D vector for further computations and finally transform to probabilities in softmax activation function (for classification problems).
Note: Total number of output features of conv layers is ((n-f+1) x (n-f+1) x num_of_filter) so input dimension of fully connected followed by a conv layer is ((n-f+1) x (n-f+1) x num_of_filter).
The last point is when implementing a CNN, the four important hyperparameters we have to decide on are:
These hyperparameters are tuned based on model performance metrics and through grid search method or other methods.
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
model = Sequential()
model.add(Conv2D(32,kernel_size=(3, 3),activation='relu',input_shape=(28,28,1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size,epochs=epochs,
verbose=1,validation_data=(x_test, y_test))