Non-Linearity in Neural Networks

Reeshabh Choudhary
7 min readNov 22, 2024

👷‍♂️ Software Architecture Series — Part 33.

We live in a multi-dimensional world. Let us pick a random entity, say a loaf of bread. We can associate multiple dimensions with it for analytics purpose. For example, we can note down length, breadth, and thickness of the bread, which corresponds to its size. Next, we can take a note of the amount of wheat flour and emulsifiers used in the baking process, along with the temperature it has been baked at. The resultant breads might vary in taste and color. However, none of the attributes mentioned earlier are linearly associated with the resultant properties of the bread. If given a task to predict taste based on the attributes mentioned earlier, there will not exist any linear algebraic function which will capture this relationship. A linear function is one which produces straight line outputs when plotted against available inputs.

f(x) = W.x + b

Here, f(x) is a linear function where relationship between input attributes and output is a straight line in multi-dimensional space. W represents the importance given to an input feature. Generally, (W.x) is called weighted sum, which all inputs into a single value that reflects the contribution of all features for a given neuron. Since, the weighted sum is constrained to pass through the origin (0, 0) in the input-output space, we add a bias which allows the network to shift the decision boundary.

However, real world data more than often involves non-linear relationships. And to make predictions based on real world data, the machine learning models based on neural networks have to utilize non-linearity to capture complex relationships. In Mathematical terms, a linear model can only represent straight lines of planes:

f(x1, x2) = W1.x1 + W2.x2 + b

But to capture complex relations which involve complex patterns, curves, or interactions, non-linearity has to be captured by a model. For example, in a binary classification problem, classes cannot be divided by a straight line. Or, in XOR problem, the target output is dependent on combination of inputs rather than direct relation between input and output. More complex problems like image recognition requires recognizing patterns like edges, shapes, and textures, which are non-linearly related to the pixel values. Due to market complexities, stock market prediction based on historical data cannot be linear. To capture non-linearity, neural networks make use of activation functions, represented as:

f(x1,x2) = f’(W1.x1 + W2.x2 + b)

This activation function ‘f’’ can produce outputs which can capture curves, kinks, etc. Depending upon the problem statement, the suitable activation function(s) is/are picked by data scientists and most of the times it is a case of trial and error. The sigmoid activation function is often used to squash the output of a neuron into a bounded range, specifically between 0 and 1. This property makes it well-suited for tasks where the network’s output needs to represent probabilities or binary classifications. By confining the output values within a limited range, the sigmoid function ensures that predictions remain within a meaningful and interpretable range. While the sigmoid activation function does introduce some level of nonlinearity, it is less effective at combating the vanishing gradient. The vanishing gradient problem occurs when gradients become very small during backpropagation, leading to negligible updates to the network’s parameters. This issue is particularly problematic for deep networks with many layers, as the gradients can diminish exponentially as they are propagated backward. When gradients are small, the network updates its parameters at a sluggish pace, effectively impeding the model’s ability to learn from the data. This can result in networks that fail to capture complex patterns and fail to generalize well to new data. Due to the vanishing gradient problem, using the sigmoid activation function between hidden layers is generally not recommended, especially for deep networks. more modern activation functions, such as ReLU and its variants (Leaky ReLU, Parametric ReLU), have become more popular choices for hidden layers. These functions address the vanishing gradient issue more effectively and enable faster convergence.

However, that does not mean linearity or linear transformation has no role to play in neural networks. A neural network first applies linear transformation to raw input data which will tune the parameters, i.e., weights and biases to adapt to the data during the training stage. The activation function can then be applied to these parameters to learn the complex relationships from the data.

To understand this clearly, let us first understand what constitutes a neural network and how does it function. The foundational principle of a neural network mirrors the interconnectedness and information processing observed in human neurons. Instead of electrical impulses, each neuron in an artificial neural network stores a numerical value representing its strength and capacity to transmit information to neighbouring neurons. The architecture of a neural network is organized into layers.

A neural network is a set of network-layers, which are broadly classified as input layer, hidden layer(s) and output layer. It is the input layer where raw datasets are received. There is no computational activity going inside this layer. It simply serves as a gateway to receive dataset and forwards them to hidden layer(s). The process of passing information from the first layer forward to the last layer is referred to as a feedforward operation.

However, the picture is not that simple. We are missing a key element of neural networks, which is also the fundamental building block of neural networks, called neuron or perceptron. Neurons in each layer are often fully connected to neurons in the previous and subsequent layers. Information flows sequentially from the input layer through intermediate layers to the output layer.

A single perceptron model

A perceptron or neuron receives multiple binary inputs, denoted as x1, x2, x3, …, where each can be either 0 or 1 and produces a single binary output. Additionally, there are corresponding weights w1, w2, w3, … associated with each input. The weights represent the importance or significance of each input in influencing the perceptron’s output.

Deciding how many neurons to include in each layer of a neural network is a non-trivial task which significantly affects a model’s performance. There is no one size fit answer to this task, and it usually depends on the complexity of the task, computational resources available, amount of data available and other factors. Similarly, deciding how many hidden layers a neural network will constitute also depends on the factors discussed earlier and usually a hit and trial approach is taken before finalizing on the final number. Adding too many layers can result in overfitting scenario where the network memorizes the training data but will perform poorly on new data in validation or test datasets. On the contrary, if we keep only a few layers to keep the model simplistic, the model may fail to capture complex relationships. The recommended approach is to start with shallow model and gradually increase the size based on hit and trial approach, combined with domain knowledge and experimentations.

The transmission of information occurs layer by layer, with data progressing from the input to the output layer. To represent it mathematically, we start from the first layer or the input layer.

The process is repeated till the final layer which has an additional softmax function to map the outputs into desired format.

A multi-layer perceptron based Neural Network

In the training stage, the generated output prediction is compared with the desired output target. A loss or error metric is chosen to quantify the difference between the predicted output and the actual target. A loss function takes two inputs: the model’s prediction and the actual ground-truth value. It computes a measure of the discrepancy between the prediction and the true value, quantifying how well the model’s output aligns with reality. This measure serves as an indicator of the quality of the model’s predictions, guiding the optimization process. It is necessary to calculate the derivative of the loss function with respect to the network’s parameters. The gradient of the loss function with respect to the parameters indicates how the loss changes as the parameters are adjusted. A gradient indicates the direction and magnitude of adjustments needed to minimize the error. These gradients are then utilized in gradient descent, an optimization technique that systematically adjusts the parameters to minimize the loss function. During each iteration of the optimization process, the parameters are updated in the direction opposite to the gradient, often known as backward propagation, which gradually steers the model towards more accurate predictions.

This is in nutshell how a neural network works and capture the non-linearity of complex relationships in the real-world data.

--

--

Reeshabh Choudhary
Reeshabh Choudhary

Written by Reeshabh Choudhary

Software Architect and Developer | Author : Objects, Data & AI.

No responses yet