# this is not always the case as was famously demonstrated that the perceptron can not

The perceptron learning algorithm is easy to understand because it makes use of an error signal which is computed as the difference between the true class label and the predicted class label. The error signal then determines by how much the weights should be updated and adjusts this set of weights in such a way that it finds an optimal set of weights that correctly predicts all samples in the dataset. The simplicity of the perceptron learning algorithm also has several drawbacks as we are assuming that there exist a set of weights in the hypothesis space that adequately separates classes. Second, we are also assuming that the dataset is linearly separable into distinct classes. This is not always the case as was famously demonstrated that the perceptron cannot learn the XOR gate because the nature of the data distribution of a XOR gate is such that its samples are not separable by one straight line.

The intuition to take out of this is that to model more complex functions, we need to make use of non-linear activation functions because they are inherently capable of more complex representations. In simple terms, they help us extend our hypothesis space such that we are more likely to come across a set of parameters that solve the learning task as defined by the data.