In this blog post, we will dive deep into the recently published work called "KAN: Kolmogorov–Arnold Networks" and compare them to the existing and commonly used Multilayer Perceptron (MLP) which is currently the cornerstone of complex neural networks.
KANs could be the next state-of-the-art (SOTA) technology that the planet will incorporate but it comes with limitations. Before we jump into the technical details, you just need to realize the following differentiation between KANs and MLPs: MLPs use fixed (non-parametrized) activation functions (that introduce non-linearity) on a set of weight matrices (layers) while KANs use learnable (parametrized) activation functions on the input matrix. Let's begin!
Kolmogorov-Arnold representation theorem
Kolmogorov–Arnold representation theorem (or superposition theorem) states that every multivariate continuous function 𝑓: [0, 1]^𝑛 → 𝑅 can be represented as a superposition of the two-argument addition of continuous functions of one variable.
More specifically for a smooth 𝑓: [0, 1]^𝑛 → 𝑅,
Let's see what this means exactly. Lets assume that we have a multivariate function where y depends on multiple variables.
Based on the theorem this can be decomposed into a "composition of univariate functions":
KAN Architecture
The previous formula is a single composition of univariate functions. In the past, the theorem was used to build neural networks but was stuck to depth-2 and width-(2n+1) representation due to old techniques. In this paper, it can be generalized to multiple compositions (causing arbitrary widths and depths), which means that the function can be re-constructed as:
Remember that in this case, the inner Φ is not a weight but an activation function that operates on the input space X. The above formula can be described as a matrix of n features and m rows:
For the sake of simplicity, let's consider only 2 variables n = 2 and m = 4. A one-layer KAN would look like this:
Following this, we can keep stacking as many layers as we want and create a KAN model (multiple KAN layers). So our end function describing the KAN model would look like this:
where L is the number of layers, Φl is the function matrix corresponding to the l-th KAN layer, and X is the input space. Φ functions are differentiable which makes KAN model trainable through back-propagation. Till this point, we have covered the basic architecture of KANs but there is one more missing piece which is the model's parameters. In MLPs, we can describe the model based on its parameters as follows:
In this case, it's visible that W are the trainable weights and σ is the fixed activation function. As stated in the beginning, in KANs the activation functions Φ are trainable. The authors use splines as activation functions and let's briefly cover why. In the paper φ is defined as follows (c is a trainable parameter):
Splines
A spline function is a piecewise-defined polynomial function used in interpolation and approximation. The primary characteristics of a spline function include its continuity and smoothness at the points where the polynomial pieces connect, known as knots.
Why use a spline you ask?
Smooth Interpolation: Spline functions provide a smooth curve that passes through or near a given set of data points, avoiding the oscillations that can occur with higher-degree polynomial interpolation.
Flexibility: Splines, especially cubic splines, offer a good balance between flexibility and smoothness, allowing for complex shapes and curves while maintaining continuity in the first and second derivatives.
Computational Efficiency: Evaluating and manipulating spline functions is computationally efficient compared to higher-degree polynomials.
B-splines are continuous and differentiable!
Local Control: Changes to the data points or knots affect only the local segments of the spline, making it easier to modify and refine the curve without impacting the entire function.
Now that we have covered all this information, we can understand the overview of the differences and characteristics of KANs and MLPs from the original paper:
Grid Extension
One of the selling points of KANs is its ability to use splines that can be made arbitrarily accurate to a target function as the grid can be made arbitrarily fine-grained. That practically means that we can fit a new fine-grained spline to an old coarse-grained spline.
By extending the grid more fine-tuned results are expected, and they propose also an initialization mechanism for the new c parameters called c' as follows:
Interpretability
One of my favorite characteristics of this model is that it can be made interpretable by a pruning technique (similar to decision trees). To do this they create a sparsification technique based on an entropy-based regularization method which relies on averaged L1-form. I'll skip these mathematical expressions but check them from the original paper. The key point here is by regularizing the model, the most irrelevant nodes are filtered out and the final function can be extracted directly from the remaining paths.
Catastrophic forgetting?
Catastrophic forgetting is a phenomenon observed in artificial neural networks where the network significantly forgets previously learned information upon learning new information. This problem arises because neural networks often update their weights to minimize error on the current task, which can inadvertently degrade performance on prior tasks.
KANs on the other hand, can avoid this since they have local plasticity and can leverage catastrophic forgetting by leveraging the locality of the splines. Preliminary results showcase this ability.
Disadvantages
Although KANs look very promising, they were just re-invented which means it will take some time till they are picked up. Don't worry, MLPs were invented in the last century and became popular only in the last couple of decades. Some key challenges that need to be addressed for this new technology are the following:
Optimization Challenges: Training KANs may face optimization difficulties. The functional forms and interdependencies between layers can lead to complex error surfaces, making gradient-based optimization methods less effective or slower.
Limited Practical Implementations: While theoretically sound, there are limited practical implementations and examples of Kolmogorov-Arnold networks being used effectively in real-world applications. This can make it difficult for practitioners to adopt and experiment with this approach.
Conclusion
In this blogpost, we have seen how Kolmogorov-Arnold Networks (KANs) work as well as some pros and cons. Since this technology came out a couple of months ago, it's not optimized to fit the existing hardware technology but in time we may see tech companies picking it up (as it had happened in the very recent past with LSTMs, Transformers, and so on). For the moment conventional neural network architectures remain the preferred choice due to their proven efficacy and ease of implementation.
** If you are interested in other ML use-cases, please contact me using the form (and also include a publicly available dataset for this case, I'm always curious to explore new problems).
Commentaires