Color Palette
Typography
A Healthy AI is more accurate
When things are in control, they run their natural course. When that control breaks down in the human body, cells form lumps; if they keep growing unchecked, they become cancer. In a garden, unwanted growth that overtakes the intended plants is called a weed. The same principle applies to neural network weights: uncontrolled growth or distortion leads to a misrepresentative model that fails to generalize.
Normalization is the process of scaling values into an acceptable range — for instance, to enable fair comparison or improve modeling — since extremely large or small values are often unhelpful.
Foundational Callouts
§1 Introduction
If you think about it, the training of neural networks is analogous to how the brain learns: hundreds of millions of natural neurons with complex, intertwined systems of electrical signals combine to form cognition. When the initial perceptron was introduced in the 1960s, the idea was to make a network strong enough or big enough that it would eventually replicate the human brain. Over the years, several efforts were carried out, from the first hardware implementation of the perceptron by Rosenblatt, to Minsky and Papert’s XOR problem, Hinton’s backpropagation, the Universal Approximation Theorem, the introduction of CNNs by LeCun, 2012’s AlexNet, GANs by Goodfellow, and modern-day Transformers. With every step of the way, there were improvements in architecture, optimization, and training techniques. As complexity increased with every new discovery, so did the methods to identify new and improved techniques to enhance the utility of these designs across different domains. It is often the case with neural network training that the improvement in the feedforward network can come from improvement in the weights, the way we calculate gradients, activations, etc. With recent advancements, the ability to optimize at every step of the process became a mandatory requirement; every opportunity to optimize should be taken into consideration when it comes to training models. Before we jump ahead too much, let us establish a couple of definitions first before we get into the main topic that is being addressed.
Input Space: \(\mathcal{X} \subset \mathbb{R}^{n}\).
Linear Transformation: Since \(T = 1\), we move immediately to the final output calculation. For each \(x \in \mathcal{X}\), set \(h_0(x) := x\). The logit is computed as: \[z_1(x) := f_1(W_1, h_0(x)) + b_1\] Here, \(f_1: \mathbb{R}^{n \times n} \times \mathbb{R}^{n} \to \mathbb{R}^{n}\) is a bilinear map (specifically the inner product), \(W_1 \in \mathbb{R}^{n \times n}\), and \(b_1 \in \mathbb{R}^{n}\).
Activation Function: The Perceptron specifically uses the Heaviside step function as its non-linearity \(\sigma: \mathbb{R}^{n} \to \mathbb{R}^{n}\): \[\sigma(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}\]
Final Output: The network function \(f: \Theta \times \mathcal{X} \to \mathcal{Y}\) is defined as: \[f(\theta, x) := \sigma(z_1(x))\]
Although the perceptron was revolutionary during its times, there were quite a few problems: the problem of linearly separable spaces, no hidden layers, etc. Although it sparked huge discussions in the scientific community, it lacked quite a bit when it comes to computing even simple functions such as xor, etc. This eventually led to the discovery of feed‑forward neural networks by adding multiple layers, and with the introduction of non‑linearity, the problem was solved. This eventually led to a fundamental result — which is groundbreaking, to say the least — the discovery of the universal approximation theorem.
Feed forward Neural Networks
Shallow feedforward networks (one hidden layer) can approximate any continuous function on a compact (closed and bounded) subset of ℝⁿ to any desired accuracy, provided they have enough neurons in that hidden layer. Given enough neurons, you can possibly model any smooth function. But in terms of practically applying it for complex cases, it does not guarantee generalizability to new data, can be trained easily, which simply shows the expressivity of the neural network now moving on to the optimization.
Input Space: \(\mathcal{X} \subset \mathbb{R}^{n}\).
Recursion: For each \(x \in \mathcal{X}\), set \(h_0(x) := x\). For each layer \(t = 1, \dots, T+1\): \[ \begin{aligned} z_t(x) &:= f_t(W_t, h_{t-1}(x)) + b_t, \\ h_t(x) &:= \sigma(z_t(x)), \end{aligned} \] where \(f_t: \mathbb{R}^{d_t \times d_{t-1}} \times \mathbb{R}^{d_{t-1}} \to \mathbb{R}^{d_t}\) is a bilinear map (e.g., dense matrix multiplication, convolution, etc.), \(W_t \in \mathbb{R}^{d_t \times d_{t-1}}\) are the weights, and \(b_t \in \mathbb{R}^{d_t}\) the bias. The final output is taken as: \[ f(\theta, x) := z_{T+1}(x). \]
Activation Function: The nonlinearity \(\sigma: \mathbb{R}^{d_t} \to \mathbb{R}^{d_t}\) is applied element‑wise (e.g., ReLU, sigmoid, tanh) and is the same for all hidden layers \(t = 1, \dots, T\). The output layer \(t = T+1\) typically uses no activation (i.e., \(\sigma\) is the identity) unless otherwise specified.
Final Output: The network function \(f: \Theta \times \mathcal{X} \to \mathcal{Y}\) is given by: \[ f(\theta, x) = z_{T+1}(x), \] with the parameter set \(\theta = (W_t, b_t)_{1 \le t \le T+1}\).
ERM
ERM is the idea of picking the parameters for the data that you have so that it works well when the data is unseen (not the real risk — we minimize the empirical risk). The problem with ERM is that it just tells us to fit to the data well but doesn't tell us what to do in cases when there can be infinite paths to achieve the minima or the trajectory we need to take (or simply put, it works well for regression but not for neural nets).
In the case of overparameterized models, there are infinitely many ways to reach the global minima, but classical ERM does not provide any principle for which global minimum to select when many exist. This is why we introduce the following paradigm: real optimizers like mini‑batch SGD run for a finite amount of time, use early stopping, and operate under finite precision. This introduces some amount of bias in selecting the minima.
One of the key problems we have witnessed in all optimizers—both in terms of weights and outputs—is how they change when a slight perturbation occurs. In the following section, we introduce some concepts and definitions that will be useful for studying this further.
Lipschitz Continuity
The Lipschitz constant allows us to study the sensitivity of the input to the output. If any of our model trained has smaller values, it essentially means even slight changes in the input space will lead to drastic changes in the output space. This result is particularly profound in the cases of alignment. A small change in the input misspecification could essentially lead to unfair reward advantage. It also prevents reward hacking or specification gaming; it also ensures that the model becomes increasingly generalizable.
Per Rademacher's theorem (Simon et al., 1983), its gradient is bounded: \(\|\nabla f\| \leq L\). Reciprocally, continuous functions with gradient bounded by \(L\) are \(L\)-Lipschitz.
Another tool that will come handy is related to the convexity of a function.
Convex functions
Next, we move on to an important class of functions called as convex functions. A convex function is essentially shaped like a valley. These types of functions come with several nice properties that are particularly useful for our case. First, the line test: any straight line through the graph stays above the graph at all times (or touches it). There are no hidden valleys — the only lowest point is the global minimum. The tangent line always lies under the curve. The function always has a "U" shape; any form of stretching or shifting still preserves convexity. Because of these unique properties, convex functions are smooth, making them particularly interesting and useful in neural networks.
Furthermore, if \(f\) is differentiable, then its gradient \(\nabla_x f\) is cyclically monotone (Peyré et al., 2017). Finally, if \(f\) is twice-differentiable its Hessian \(H_x f\) is a positive semidefinite matrix.
In the following, the notation \(\|\cdot\|_2\) will denote the common Euclidean norm for vectors, i.e.: \[ \|x\|_2 = \sqrt{\sum_i x_i^2}, \tag{1.5} \] whereas for linear operators \(x \mapsto W x\) it will denote the spectral norm: \[ \|W\|_2 = \max_{\|x\|_2 \leq 1} \|W x\|_2 = \sigma_{\max}, \tag{1.6} \] where \(\sigma_{\max}\) is the largest singular value in the SVD of \(W\).
Problem of optimal trasnport
The theory of optimal transport, pioneered by Gaspard Monge in 1781 and revolutionized by Leonid Kantorovich in the 1940s, reveals that the problem of finding the most economical way to transport mass from one distribution to another is intimately and inexorably linked to the geometry of distance metrics and the analytic structure of convex functions. The optimal transport problem begins with a simple and intuitive premise: given two probability distributions P and Q over a space Rm. what is the minimal expected cost required to reshape P into Q The optimal transport problem thus provides a mathematically rigorous way to compare probability distributions by respecting the underlying geometry of the space, with profound applications
The right‑hand side of (1.7) corresponds to a Kantorovich problem. It is often presented as a relaxation of the Monge problem: \[ \inf_{T \# P = Q} \int_{\mathbb{R}^m} d(x, T(x)) \, dp(x). \tag{1.9} \] In this context, the (optimal) \(T: \mathbb{R}^m \to \mathbb{R}^m\) is called the (optimal) Monge map. For arbitrary measures, the two problems are different. The Kantorovich problem always admits a solution. This is not the case with the Monge problem. For example, for discrete measures with supports of size \(|P| < |Q|\), it is impossible to fulfill the pushforward condition \(T\#P = Q\) since \(|T\#P| \le |P| \le |Q|\). Nonetheless, when both \(P\) and \(Q\) admit a density (w.r.t. Lebesgue measure) then the two problems happen to be equivalent. In this case, the Monge problem corresponds to a deterministic coupling \(\pi(x, \cdot) = \delta_{T(x)}\).
By Remark 6.3 of Villani (2008) the Wasserstein‑1 distance is the Kantorovich‑Rubinstein distance: \[ W_1(P, Q) = \sup_{f \in \text{Lip}_1(X, \mathbb{R})} \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{z \sim Q}[f(z)]. \tag{1.10} \]
In our case we are working with neural networks that are Lipschitz w.r.t. \(\ell_2\) distance, so we have \(d(x, y) := \|x - y\|_2\). This chapter is one of the first motivations for the design of Lipschitz networks, and a strong inspiration for most of the work throughout the thesis, notably chapters...
Binary cross entropy loss
Binary Cross‑Entropy (BCE) Loss – The Confidence‑Based Scorer Imagine you are teaching a child to tell cats from dogs. You show a picture and ask: “How sure are you that this is a cat?” The child answers with a number between “definitely a dog” and “definitely a cat.” The BCE loss measures how surprised you should be by that answer. In machine learning, a binary classifier usually produces a raw score (called a logit) and then squeezes it through a soft, S‑shaped curve (the logistic function) to turn it into a probability between 0 and 1. The BCE loss then looks at the true label (say, 1 for cat, 0 for dog) and the predicted probability. If the predicted probability matches the true label very well (e.g., 0.99 for “cat” when the picture really is a cat), the loss is tiny. If the prediction is wrong (e.g., 0.01 for “cat” when it is actually a cat), the loss becomes enormous – the loss grows faster and faster as the mistake gets worse. There is an extra knob called temperature (often written as τ). Turning up the temperature makes the network more confident: even a small positive score gets pushed close to 0 or 1. Turning it down makes the predictions softer, more hesitant. Why does that matter? If the network is already constrained to be gently changing (Lipschitz networks), the temperature becomes a powerful tool to control how “sharp” or “smooth” the final decisions are. A lower temperature often improves generalization – the network does not scream “I am 100% sure” on every training example, so it stays more flexible on new data. BCE is loved because it gives smooth, well‑behaved gradients (the directions in which to adjust the network) and naturally produces probabilities that can be interpreted as confidence. It is the standard choice for binary classification in almost all modern deep learning frameworks.
Hinge loss
Now think of a different teacher. Instead of asking “how sure are you?”, they ask: “Is your decision separated by a clear border?” The Hinge loss is the brain behind Support Vector Machines, but it works equally well in neural networks. The Hinge loss cares about a margin – a buffer zone around the decision boundary. For a given training example, the network produces a raw score (positive for one class, negative for the other). The loss checks whether that score is far enough away from zero in the correct direction. If the score is, say, +5 for a cat picture and the margin is set to 1, then the loss is zero – the example is safe and well on the correct side. If the score is +0.5, still correctly positive but inside the margin, the loss is a positive number proportional to how much we still need to push it outward. If the sign is wrong (negative for a cat), the loss gets even bigger. The size of the margin (commonly written as m) is a hyper‑parameter. A larger margin demands that examples be farther from the decision boundary. That is great for robustness: if a small perturbation to the input cannot move the score across zero because you have a thick margin, the classifier is inherently more stable. Lipschitz networks pair beautifully with Hinge loss because the Lipschitz constant tells you exactly how far a small input change can move the output. The margin divided by the Lipschitz constant gives a guaranteed radius of safety – no misclassification within that radius. Unlike BCE, Hinge loss is piecewise linear. It is completely flat for examples that already satisfy the margin – such examples stop influencing training. That creates a sparse set of “support vectors” (only the violators matter), which can be efficient but sometimes leads to slower learning because gradient signals become sparse.
The Binary Cross‑Entropy (BCE) loss (also called log loss) is among the most popular choices of loss within the deep learning community. Let \(f: \mathbb{R}^n \to \mathbb{R}\) be a neural network. For an example \(x \in \mathbb{R}^n\) with label \(y \in \mathcal{Y}\), and \(\sigma(x) = \frac{1}{1 + \exp(-x)}\) the logistic function mapping logits to probabilities, the BCE is written: \[ L_{\tau}^{\text{bce}}(f(x), y) = -\log \sigma\bigl(y \tau f(x)\bigr), \] with (inverse) temperature / scaling parameter \(\tau > 0\). This hyper‑parameter of the loss defaults to \(\tau = 1\) in most frameworks such as TensorFlow or PyTorch. Note that \(L_{\tau}^{\text{bce}}(f(x), y) = L_{1}^{\text{bce}}(\tau f(x), y)\), so we can equivalently tune \(\tau\) or the Lipschitz constant \(L\). We show in Chapter 4 that for LipNet the temperature \(\tau\) allows to control the generalization gap.
The Hinge loss \[ L_{m}^{\text{H}}(f(x), y) = \max\bigl(0,\; m - y f(x)\bigr) \] with margin \(m > 0\) is also of particular interest.
Applying lipschitz constant to classification settings
A classifier is defined as:
- Binary: \(h(x) = \operatorname{sign}(f(x))\)
- Multiclass: \(\hat{k}(x) = \arg\max_{k \in \{1,\dots,K\}} f_k(x)\)
Binary case: The network produces a single number. If that number is positive, the classifier says “class A”; if negative, it says “class B”. The magnitude of the number (how far away from zero it is) is like a confidence or margin. Multiclass case: The network produces a list of numbers, one for each class (e.g., cat, dog, bird). The classifier picks the class with the highest number. The gap between the highest number and the second highest is the margin – a measure of how clearly the network prefers one class over the others.
Lipschitz constrained neural network
In a lipschitz constrained neural network Each individual layer (whether it’s an affine layer or an activation) has its own local Lipschitz constant – a number that tells you the maximum amplification of distances that this layer can produce. For example, if you move the input to that layer by one unit, the output can move at most by that layer’s Lipschitz constant. For activation functions like ReLU or sigmoid, this constant is at most 1 (they do not stretch distances). For an affine layer (multiplying by a weight matrix and adding a bias), the Lipschitz constant equals the largest singular value of the weight matrix – essentially the maximum factor by which the layer can stretch any direction.
A Lipschitz-constrained neural network is a feed-forward network such that for each layer \(d \in \{1,\dots,D\}\), the mapping \(x_d \mapsto f_d(\theta_d, x_d)\) is \(L_d\)-Lipschitz for all \(\theta_d\), i.e., \[ \forall x_d, y_d \in \mathbb{R}^{n_d}, \quad \| f_d(\theta_d, x_d) - f_d(\theta_d, y_d) \|_2 \le L_d \| x_d - y_d \|_2. \] Then the overall function \(x \mapsto f(\theta, x)\) is \(L\)-Lipschitz with \[ L \le \prod_{d=1}^D L_d \qquad \forall \theta \in \Theta. \] Let \(\text{LipNet}_1\) denote the class of Lipschitz-constrained networks for which the global Lipschitz constant satisfies \(L = 1\). For simplicity, all networks in this manuscript are considered \(L\)-Lipschitz with respect to the Euclidean \(\ell_2\)-norm.
Lower bounds
- Maximum gradient norm (for differentiable \(f\)): \[ \widehat{L}_{\nabla} = \max_{x \in S} \|\nabla_x f(x)\|_2. \]
- Maximum pairwise finite difference: \[ \widehat{L}_{\text{pair}} = \max_{\substack{x,y \in S \\ x \neq y}} \frac{\|f(x)-f(y)\|_2}{\|x-y\|_2}. \]
- Maximum over adversarial perturbations: \[ \widehat{L}_{\text{adv}} = \max_{x \in S} \frac{\|f(x)-f(x+\delta_x)\|_2}{\|\delta_x\|_2}, \] where \(\delta_x\) is an adversarial attack (e.g., projected gradient descent) maximizing \(\|f(x+\delta)-f(x)\|_2\) subject to \(\|\delta\|_2 \le \varepsilon\).
Formal methods for Upper bounds
AutoLip
Layer‑wise composition bound: spectral norm product → fast global Lipschitz estimate.
Linear Relaxation
Decomon / auto‑LiRPA: local Lipschitz via convex over‑approximations of activations.
LipSDP
Semidefinite programming relaxation – tight global Lipschitz for many activations.
Polyhedron Propagation
Forward propagation of convex polyhedra → max spectral norm over activation patterns.
LiPopt
Difference‑of‑convex programming: sequential convex optimization for tight upper bounds.
Next we move on to discussing how to enforce Lipschitz bounds in practice.
Enforcing Lipschitz Bounds
Let \(f: \mathbb{R}^m \to \mathbb{R}^n\) be a neural network. To enforce a global Lipschitz bound in practice, we use activations with Lipschitz constant \(l_d\) and apply a constraint \(\Pi : \mathbb{R}^p \to \Theta\) on the weights of affine layers. For an affine layer \(f_d(z) = W_d z + b_d\), the minimal Lipschitz constant is the spectral norm \[ l_d = \|W_d\|_2 := \max_{\|x\|_2 \le 1} \|W_d x\|_2, \] where \(\|x\|_2\) and \(\|W_d x\|_2\) are Euclidean (vector) norms. The constraint set is \[ \Theta = \{\, W_d \mid \|W_d\|_2 \le l_q \,\}. \]
The seminal work of Anil et al. (2019) proved that universal approximation in the set of \(L\)-Lipschitz functions is achievable with certain weight constraints and the GroupSort activation function. GroupSort operates on pairs of consecutive neurons, not elementwise. For all coordinates \(i\) (with zero‑based indexing), it is defined as: \[ \operatorname{GroupSort}_2(x)_{2i,\,2i+1} = \bigl[ \min(x_{2i}, x_{2i+1}),\; \max(x_{2i}, x_{2i+1}) \bigr]. \]
The theorem of Anil et al. (2019) bounds the “two‑to‑infinity” norm \(\|\cdot\|_{2\to\infty}\) of the first layer and the \(\|\cdot\|_\infty\) norms of further layers to obtain universal approximation in \(\operatorname{Lip}_1(\mathcal{X}, \mathbb{R})\). These networks are \(1\)-Lipschitz by construction. Because \(\operatorname{Lip}_L(\mathcal{X}, \mathbb{R}^K) = \{ L f \mid f \in \operatorname{Lip}_1(\mathcal{X}, \mathbb{R}^K) \}\), the same architecture can approximate any \(L\)-Lipschitz function after scaling the output by \(L\).
- First affine layer: \(W_1 = [0.5,\,-0.5,\,0.5,\,-0.5]^{\mathsf{T}}\), \(b_1=0\). Then \(\|W_1\|_2 = 1\) and \(\|W_1\|_{2\to\infty}\) is bounded.
- GroupSort on two pairs yields \(h = [-0.5|x|,\,0.5|x|,\,-0.5|x|,\,0.5|x|]^{\mathsf{T}}\).
- Second layer: \(W_2 = [0,1,0,0]\), \(b_2=0\). Its infinity norm is \(\|W_2\|_\infty = 1\).
Lipschitz Constraint Methods
| Method | Type | Guarantee | Key Property |
|---|---|---|---|
| Orthogonal weights (GNP) | Hard constraint | \(\|W\|_2 = 1\) | Gradient norm preserving |
| Spectral normalization | Hard constraint | \(\|W\|_2 = 1\) | Power iteration (lower bound) |
| Differentiable reparametrization | Hard constraint | \(\|W\|_2 \le L_d\) | Unconstrained optimization |
| Projected gradient descent | Hard constraint | \(\|W\|_2 \le L_d\) | Projection onto \(\Theta\) |
| Weight clipping | Hard constraint | \(\|W\|_2 \le c\sqrt{\min(n,m)}\) | Coefficient‑wise projection onto \([-c,c]\) |
| Frobenius normalization | Hard constraint | \(\|W\|_2 \le 1\) | Projection \(W \leftarrow W / \|W\|_F\) |
| Direct parametrization (dynamical) | Hard constraint | Whole block 1‑Lipschitz | \(z = x - \frac{2}{\|W\|_2^2} W^{\mathsf{T}}\sigma(Wx+b)\) |
| Sandwich layer (QP formulation) | Hard constraint | \(\| \text{block} \|_2 \le 1\) | \(x \mapsto W_2^{\mathsf{T}} \sigma(W_1 x + b)\) |
| Gradient penalty | Soft penalty | None (crude upper bound) | Nested differentiation |
| Spectral regularization | Soft penalty | None | Penalty on \(\|W_i\|_2^2\) |
One effective way to enforce structural constraints—such as orthogonality, low rank, or unit norm—on weight vectors is to map them onto a manifold. By definition, a manifold is a space that locally resembles Euclidean space but may have a more complex global geometry; constraining weights to lie on a specific manifold guarantees that they satisfy the desired properties at every step of training. To implement this idea in practice, we must first give a precise mathematical definition of a manifold and then choose an appropriate distance metric that measures how far two points are on that manifold, as well as an optimization method that respects the manifold’s geometry. Different manifolds call for different metrics and optimizers; we now formalize these concepts and highlight the most relevant combinations for typical machine learning tasks.
Mathematical definition of a manifold
- Locally Euclidean of dimension \(n\): For every point \(p \in M\), there exists an open neighbourhood \(U \subseteq M\) of \(p\), an open set \(\widetilde{U} \subseteq \mathbb{R}^n\), and a homeomorphism \(\varphi: U \to \widetilde{U}\). The pair \((U, \varphi)\) is called a chart (or coordinate chart) around \(p\). The homeomorphism \(\varphi\) is often written as \(\varphi(q) = (x^1(q), \dots, x^n(q))\), where \(x^i(q)\) are the local coordinates of \(q\).
- Hausdorff property: For any two distinct points \(p, q \in M\), there exist disjoint open sets \(U_p, U_q \subseteq M\) such that \(p \in U_p\) and \(q \in U_q\).
- Second-countability: The topology of \(M\) has a countable base. That is, there exists a countable collection \(\{B_i\}_{i \in \mathbb{N}}\) of open sets such that every open set in \(M\) can be expressed as a union of some of the \(B_i\).
- An atlas on \(M\) is a collection of charts \(\mathcal{A} = \{(U_\alpha, \varphi_\alpha)\}_{\alpha \in I}\) such that the domains cover \(M\): \(\bigcup_{\alpha \in I} U_\alpha = M\).
- Two charts \((U_1, \varphi_1)\) and \((U_2, \varphi_2)\) are called \(C^\infty\)-compatible (or smoothly compatible) if either \(U_1 \cap U_2 = \emptyset\) or the transition map \[ \varphi_2 \circ \varphi_1^{-1}: \varphi_1(U_1 \cap U_2) \longrightarrow \varphi_2(U_1 \cap U_2) \] is a smooth diffeomorphism (i.e., infinitely differentiable with a smooth inverse) between open subsets of \(\mathbb{R}^n\).
- A \(C^\infty\)-atlas is an atlas in which every two charts are \(C^\infty\)-compatible.
Matrix Manifolds for Weight Constraints
| Manifold | Type | Dimension | Key Property | Typical Optimizer |
|---|---|---|---|---|
| Stiefel \(\operatorname{St}(n,m)\) | Hard constraint | \(mn - n(n+1)/2\) | \(X^\top X = I_n\) (orthonormal columns) | Riemannian SGD, QR retraction |
| Grassmann \(\operatorname{Gr}(k,n)\) | Hard constraint | \(k(n-k)\) | Subspaces: \(X \sim XQ,\; Q\in O(k)\) | Riemannian CG, chordal distance |
| Oblique \(\operatorname{OB}(m,n)\) | Hard constraint | \(m(n-1)\) | Each row has unit norm | Row‑wise normalisation |
| Orthogonal \(O(n)\) | Hard constraint | \(n(n-1)/2\) | \(Q^\top Q = I_n\) | Cayley transform, exponential map |
| Fixed‑rank \(\mathcal{M}(m,n,r)\) | Hard constraint | \(r(m+n-r)\) | \(\operatorname{rank}(X)=r\) | Riemannian GD, SVD retraction |
| Flag \(\operatorname{Fl}(d_1,\dots,d_k,n)\) | Hard constraint | \(\frac12(n^2-\sum_i (d_i-d_{i-1})^2)\) | Nested subspaces \(V_1\subset\cdots\subset V_k\) | Hierarchical Riemannian CG |
| PSD cone \(\mathcal{S}_+^n\) | Hard constraint (interior) | \(n(n+1)/2\) | \(X \succeq 0\) (positive semidefinite) | Log‑Euclidean, Bures‑Wasserstein |
| Fixed‑rank PSD | Hard constraint | \(r(2n-r+1)/2\) | \(X\succeq 0,\; \operatorname{rank}(X)=r\) | Bures‑Wasserstein gradient |
| Finsler (spectral norm) | Hard constraint (norm) | \(mn\) (ambient) | \(\|W\|_2 = 1\) (non‑Riemannian) | Muon, Manifold Muon |
On a manifold, the choice of distance metric determines the geometry of the optimization landscape. The most fundamental is the Riemannian geodesic distance, defined as the infimum of lengths of all smooth curves connecting two points. For matrix manifolds, simpler alternatives are often used: the chordal distance on Grassmannians (based on projection matrices), the Procrustes distance on Stiefel (aligning bases via orthogonal transformations), and the Bures–Wasserstein distance for fixed‑rank positive semidefinite matrices. Optimizers on manifolds replace the Euclidean gradient with its Riemannian gradient (projection onto the tangent space) and use a retraction to map updates back to the manifold. Common optimizers include Riemannian SGD, Riemannian Adam, Cayley SGD (for orthogonal groups), Muon (steepest descent under the spectral norm, a Finsler geometry), Manifold Muon (extended to Stiefel), and the landing algorithm (soft constraint penalty). The table below lists these distances and optimizers alongside their associated manifolds.
- Geodesic distance – general Riemannian manifolds; computed via principal angles for Stiefel/Grassmann.
- Chordal distance – Grassmann manifold: \(d_{\text{ch}}([X],[Y]) = \frac{1}{\sqrt{2}}\|XX^\top - YY^\top\|_F\).
- Procrustes distance – Stiefel manifold: \(d_{\text{Proc}}(X,Y) = \min_{Q\in O(k)} \|X - YQ\|_F\).
- Bures–Wasserstein distance – fixed‑rank PSD matrices: \(d_{\text{BW}}^2(X,Y) = \operatorname{tr}(X)+\operatorname{tr}(Y)-2\operatorname{tr}\bigl((X^{1/2}YX^{1/2})^{1/2}\bigr)\).
- Log‑Euclidean distance – full‑rank SPD cone: \(d_{\text{LE}}(X,Y) = \|\log(X) - \log(Y)\|_F\).
- Riemannian SGD – generic; uses retraction (e.g., QR for Stiefel).
- Riemannian Adam – adaptive moment estimation on tangent spaces with vector transport.
- Cayley SGD – orthogonal group; replaces exponential map with cheaper Cayley transform.
- Muon – unconstrained matrix space under spectral norm; update = \(-UV^\top\) from top singular vectors of gradient.
- Manifold Muon – Stiefel manifold: project gradient, compute top singular vectors, project back, retract.
- Landing algorithm – enforces constraint via increasing penalty; no explicit retraction.
- Riemannian trust‑region – second‑order method for matrix manifolds (e.g., Grassmann).
Spherical Manifold – Definition & Hyperspherical Descent
Definition: The sphere (hypersphere) is the set of unit‑norm vectors in \(\mathbb{R}^d\):
\[ S^{d-1} = \{ w \in \mathbb{R}^d \mid \|w\|_2 = 1 \}. \]
Tangent space at \(w\): \( T_w S^{d-1} = \{ v \in \mathbb{R}^d \mid v^\top w = 0 \} \).
Riemannian metric: \( g_w(v_1, v_2) = v_1^\top v_2 \).
Geodesic distance: \( d(w_1, w_2) = \arccos(w_1^\top w_2) \).
Retraction (normalisation): \( \operatorname{Retr}_w(v) = \frac{w + v}{\|w + v\|_2} \).
Optimization problem: Minimise loss \(f(w)\) subject to \(w \in S^{d-1}\). Given Euclidean gradient \(g = \nabla f(w)\), find update \(a \in T_w S^{d-1}\) that:
- Minimises \(a^\top g\) (linearised loss change),
- Satisfies \(\|a\|_2 = \eta\) (learning rate),
- Lies in tangent space: \(a^\top w = 0\).
Lagrange multiplier solution yields:
\[ a_{\text{opt}} = -\eta \; \frac{g - (w^\top g) w}{\|g - (w^\top g) w\|_2}. \]
Step‑by‑step hyperspherical descent algorithm:
- Initialise \(w = w_0 / \|w_0\|_2\).
- Compute Euclidean gradient \(g = \nabla f(w)\).
- Project \(\tilde{g} = g - (w^\top g) w\).
- If \(\|\tilde{g}\|_2 = 0\), stop (stationary point).
- Compute update direction \(a = -\eta \cdot \tilde{g} / \|\tilde{g}\|_2\).
- Tentative update \(w_{\text{temp}} = w + a\).
- Retract \(w \leftarrow w_{\text{temp}} / \|w_{\text{temp}}\|_2\) (equivalent to \(w \leftarrow \frac{w - \eta \tilde{g}/\|\tilde{g}\|_2}{\sqrt{1+\eta^2}}\)).
- Repeat from step 2 until convergence.
Python implementation:
import numpy as np
def hyperspherical_descent(f, grad_f, w0, eta, num_iters):
w = w0 / np.linalg.norm(w0)
for _ in range(num_iters):
g = grad_f(w)
w_dot_g = np.dot(w, g)
g_tilde = g - w_dot_g * w
norm_g_tilde = np.linalg.norm(g_tilde)
if norm_g_tilde == 0:
break
a = -eta * (g_tilde / norm_g_tilde)
w = (w + a) / np.linalg.norm(w + a)
return w
Remarks: \(\eta\) is the Euclidean step length in the tangent space. For small \(\eta\), the retraction approximates Riemannian gradient descent. Convergence to a stationary point holds under standard Lipschitz gradient assumptions. The normalisation retraction is the simplest and most common.
Analytical Derivation – Optimal Update on the Hypersphere
We derive the optimal update from first principles using Lagrange multipliers, exactly as done for the Stiefel manifold in the Manifold Muon text.
Problem setup. Current point \(w \in S^{n-1} \subset \mathbb{R}^n\) with \(\|w\|_2 = 1\); Euclidean gradient \(g = \nabla f(w) \in \mathbb{R}^n\); learning rate \(\eta > 0\). We seek an update vector \(a \in \mathbb{R}^n\) that minimises the linearised loss change \(a^\top g\) subject to:
\[ \text{(i)}\; \|a\|_2 = \eta, \qquad \text{(ii)}\; a^\top w = 0. \]
1. Lagrangian formulation. Introduce multipliers \(\lambda\) (norm constraint) and \(\mu\) (orthogonality):
\[ \mathcal{L}(a,\lambda,\mu) = a^\top g + \frac{\lambda}{2}(a^\top a - \eta^2) + \mu (a^\top w). \]
Derivative w.r.t. \(a\): \(g + \lambda a + \mu w = 0 \;\Rightarrow\; a = -\frac{1}{\lambda}(g + \mu w).\)
2. Enforce constraints. From \(a^\top w = 0\):
\[ -\frac{1}{\lambda}(g^\top w + \mu (w^\top w)) = 0 \;\Rightarrow\; g^\top w + \mu = 0 \;\Rightarrow\; \mu = -g^\top w. \]
Thus \(a = -\frac{1}{\lambda}(g - (g^\top w) w)\). Define the tangent component:
\[ \tilde{g} = g - (g^\top w) w \in T_w S^{n-1}. \]
Now enforce \(\|a\|_2 = \eta\): \(\|-\frac{1}{\lambda} \tilde{g}\|_2 = \frac{\|\tilde{g}\|_2}{|\lambda|} = \eta \;\Rightarrow\; |\lambda| = \frac{\|\tilde{g}\|_2}{\eta}.\)
Choose \(\lambda > 0\) (descent direction) to obtain:
\[ a_{\text{opt}} = -\eta \,\frac{\tilde{g}}{\|\tilde{g}\|_2}. \]
3. Retraction (radial projection). After the update \(w' = w + a_{\text{opt}}\), the point leaves the sphere. Because \(a_{\text{opt}} \perp w\):
\[ \|w + a_{\text{opt}}\|_2^2 = \|w\|_2^2 + \|a_{\text{opt}}\|_2^2 = 1 + \eta^2. \]
Hence the normalised (retracted) point is
\[ w_{\text{new}} = \frac{w + a_{\text{opt}}}{\sqrt{1+\eta^2}}. \]
4. Full update rule. Substituting \(a_{\text{opt}}\):
\[ w_{\text{new}} = \frac{w - \eta \,\frac{\tilde{g}}{\|\tilde{g}\|_2}}{\sqrt{1+\eta^2}}, \qquad \tilde{g} = g - (g^\top w) w. \]
Equivalently, in a single line:
\[ \boxed{w \;\leftarrow\; \frac{w - \eta\,\frac{g - (w^\top g) w}{\|g - (w^\top g) w\|_2}}{\sqrt{1+\eta^2}}} \]
5. Comparison with Stiefel (Manifold Muon).
| Aspect | Sphere \(S^{n-1}\) | Stiefel \(\operatorname{St}(m,n)\) |
|---|---|---|
| Constraint | \(w^\top w = 1\) | \(W^\top W = I_n\) |
| Tangent condition | \(a^\top w = 0\) | \(A^\top W + W^\top A = 0\) |
| Norm used | Euclidean (\(\ell_2\)) | Spectral norm |
| Optimal update direction | \(-\eta \, \tilde{g} / \|\tilde{g}\|_2\) | \(-\eta \, \operatorname{msign}(G + 2W(\Lambda+\Lambda^\top))\) (dual ascent) |
| Retraction | \(w \leftarrow (w + a)/\sqrt{1+\eta^2}\) | \(W \leftarrow \operatorname{msign}(W)\) |
The spherical case is simpler because the tangent space is a linear subspace and the norm constraint is isotropic. The Stiefel case requires solving a dual problem (\(\Lambda\) ascent) due to the non‑trivial coupling of columns via the spectral norm.
6. Algorithm summary (hyperspherical descent).
Input: current \(w\) on sphere, gradient \(g\), learning rate \(\eta\).
Compute tangent gradient: \(\tilde{g} = g - (w^\top g) w\).
If \(\tilde{g} = 0\): stationary point → skip update.
Else set \(a = -\eta \, \tilde{g} / \|\tilde{g}\|_2\) and update \(w \leftarrow (w + a) / \sqrt{1+\eta^2}\).
Optionally renormalise for numerical stability.
CONSIDER SUPPORTING ME click or scan
Click here to contribute
Scan to support
Some example manifold shapes