Neural networks can be thought of as a function that can map between inputs and outputs. In theory, no matter how complex this function is, neural networks should be able to approximate this function. However, most, if not all, of the supervised learning is about learning a particular function assigned by the maps X And And And then using this function to find the appropriate And NS New X. If so, what is the difference between traditional machine learning algorithms and neural networks? The answer is known as Inductive bias. The term may sound new. But, it is nothing but the assumptions we make on the relationship between X and Y before fitting a machine learning model into it.
For example, if we think that the relationship between X and Y is linear, we can use linear regression. The inductive bias of linear regression is that the relationship between X and Y is linear. Hence, it fits a line or hyper plane of the data.
But when there is a complex and non-linear relationship between X and Y, the linear regression algorithm may not do a great job of predicting Y. In this case, we may need a curve or a multidimensional curve to approximate that relationship. The main advantage of neural networks is that the inductive bias is very weak, and therefore, no matter how complex this relationship or function is, the network is somehow able to approximate it.
But also, depending on the complexity of the function, we may have to manually set the number of neurons in each layer and the number of layers in the network. This is usually done by trial and error and experience. Hence, these parameters are called hyperparameters.
Neural networks are nothing but complex machines to fit a curve. – Josh Starmer
Engineering and work of neural networks
Before we see why neural networks work, it will be appropriate to show what neural networks do. Before understanding the architecture of a neural network, we need to look at what a neuron does first.
Each input of an artificial neuron has a weight attached to it. The inputs are first multiplied by their own weights and a bias is added to the result. We can call this a weighted sum. Then the weighted sum goes through the activation function, which is basically a nonlinear function.
Therefore, artificial neurons can be considered as a simple or multiple linear regression model with an activation function at the end. Having said that, let’s move on to the architecture of the neural network
A neural network usually has multiple layers with each layer containing multiple neurons, where all neurons from one layer are connected to all neurons in the next layer and so on.
In Figure 1.2, we have 4 layers. The first layer is the input layer that looks like it contains 6 neurons, but is actually only the data that is given as input to the neural network (there are 6 neurons because the input data probably has 6 columns). The last layer is the output layer. The number of neurons in the final layer and the first layer is predetermined by the data set and the type of problem (number of output classes etc.). The number of neurons in the hidden layers and the number of hidden layers is chosen by trial and error.
neuron layer I It will take all neurons out of the layer i-1 As input and calculating the weighted sum adds a bias to it, and then finally, sends it through the activation function, as seen above in the case of artificial neurons. The first neuron of the first hidden layer will be connected to all the inputs from the previous layer ( input layer). Similarly, the second neuron of the first hidden layer will also be connected to all the inputs from the previous layer, and so on for all neurons in the first hidden layer. For neurons in the second hidden layer, the outputs of the previously hidden layer are considered as inputs and each of these neurons is connected to all previous neurons, similarly.
A layer with m neurons, preceded by a layer with n neurons will have n * m + m (including bias) Links or links with each link bearing weight. These weights are randomly initialized but when trained, they reach their optimum value to reduce the loss function of our choice. We will see how to learn these weights in detail in the next blog.
Example of forward propagation
Let’s consider the neural network we have in Figure 1.2 and then show how forward propagation works with this network for a better understanding. We can see that there are 6 neurons in the input layer which means there are 6 inputs.
Note: For calculation purposes, I do not include biases. But, if biases are included, there will simply be an extra entry I0 whose value will always be 1 and there will be an extra row at the beginning of the weight matrix w01 , w02 … .w04
Let the input be i = [ I1, I2, I3, I4, I5, I6 ]. We can see that the first hidden layer has 4 neurons. Therefore, there will be 6 * 4 Links (without bias) between the input layer and the first hidden layer. These connections are represented in green in the weight matrix below with w_ij values which represent the weight of the connection between them ITen of the neurons of the input layer and yTen neurons of the first hidden layer. if we hit (matrix multiplication) The 1 * 6 input matrix with 6 * 4 Weight matrix, we will get the output of the first hidden layer which is 1 * 4. This makes sense because there are literally 4 neurons in the first hidden layer.
These four outputs are represented in red in Fig. 2.1. Once we have these values, we send them through the activation function in order to introduce the nonlinearity, and then these values will be the exact output of the first hidden layer.
Now, we continue the same steps for the second hidden layer with a different weight array for it.
i1, i2, etc. are only the outputs of the previous class. I am using the same variable for ease of understanding. Similar to what we saw earlier, the input matrix 1 * 4 will be multiplied by the weight matrix 4 * 3 (Because the second hidden layer has 3 neurons), which produces a 1 * 3 matrix. The activation of the individual elements in that matrix will be the input for the next layer.
Take a simple guess as to what the weight matrix will look like for the final layer
Since the final layer contains only 1 neuron and the previous layer has 3 outputs, the weight array will be of size 3 * 1, This marks the end of forward propagation in a simple feed-forward neural network.
Why does this approach work?
We have already seen what each neuron in the network does not differ much from linear regression. In addition, the neuron adds an activating function at the end and each neuron has a different weight transmitter. But, why does this work?
Now we have already seen how the calculation works. But my main goal with this blog is to shed some light on why this approach works. In theory, neural networks should be able to approximate any continuous function, however complex and nonlinear it may be. I will do my best to convince you, and to convince myself of that, of the right standards (weights and biases), The network should be able to learn anything the way we saw above.
The importance of nonlinearity
Before we go any further, we need to understand the power of nonlinearity. When we add two or more linear objects such as a line, plane, or hyperplane, the output is also a linear object: a line, plane, or hyperplane respectively. No, no matter what ratio we add to these linear objects, we still get a linear object.
But this is not the case for addition between nonlinear objects. When we add two different curves, we’ll likely get a more complex curve. If we can add different parts of these nonlinear curves in different proportions, we should somehow be able to influence the shape of the resulting curve.
In addition to just adding nonlinear objects or let’s say “hyper-curves” like “hyper-planes”, we also introduce nonlinearities in each layer through these activation functions. Which basically means, that we are applying a nonlinear function to an already nonlinear object. And by adjusting these biases and weights, we can change the shape of the resulting curve or function.
This is why more complex problems require more hidden layers and more hidden neurons and less complex problems or relationships can be approximated with fewer layers and neurons. Each neuron works to solve problems. They all solve their own small problems and together they solve a larger problem which is usually to reduce the cost function. The exact word to use here is Divide and conquer.
What if neural networks don’t use activation functions?
If neural networks do not use the activation function, it is just a large linear unit that can easily be replaced by a single linear regression model.
y = m * x + c
z = k * y + t => k * (m * x + c) + t => k * m * x + k * c + t => (k * m) * x + (k * c + t)
Here, Z is also linearly dependent on x where k * m can be replaced by another variable, and k * c + t can be replaced by another variable. Hence, without activation functions, no matter how many layers and how many neurons there are, all of these would be redundant.
We have seen how neural networks calculate their output and why this method works. Simply put, the main reason why neural networks can learn complex relationships is that in each layer we introduce nonlinearity and add different proportions of the output curve in order to get the desired result and this result also goes through the activation function and the same process is repeated to further customize the output. All weights and biases in the network are significant and can be modified in certain ways to approximate the relationship. Although the weights assigned to each neuron are initially random, they will be learned by a special algorithm called Reverse spread.