Color Palette

Primary text
Background
Accents
Theorems
Definitions
Warnings

Typography

font-display · EB Garamond

A Healthy AI is more accurate


When things are in control, they run their natural course. When that control breaks down in the human body, cells form lumps; if they keep growing unchecked, they become cancer. In a garden, unwanted growth that overtakes the intended plants is called a weed. The same principle applies to neural network weights: uncontrolled growth or distortion leads to a misrepresentative model that fails to generalize.

Normalization is the process of scaling values into an acceptable range — for instance, to enable fair comparison or improve modeling — since extremely large or small values are often unhelpful.

Foundational Callouts

Prerequisite — Linear algebra
Understanding how matrices act as linear maps, and the role of eigenvalues and singular values, is essential for analyzing how information propagates through a network.
Prerequisite — Real analysis
Concepts like continuity, differentiability, and the distinction between pointwise and uniform convergence appear throughout the theory of function approximation.
Prerequisite — Probability basics
Many empirical estimates involve expectations over data distributions; understanding variance and concentration helps interpret sample-based bounds.

§1 Introduction

ai safety Normalization fat ai skinny ai 4

If you think about it, the training of neural networks is analogous to how the brain learns: hundreds of millions of natural neurons with complex, intertwined systems of electrical signals combine to form cognition. When the initial perceptron was introduced in the 1960s, the idea was to make a network strong enough or big enough that it would eventually replicate the human brain. Over the years, several efforts were carried out, from the first hardware implementation of the perceptron by Rosenblatt, to Minsky and Papert’s XOR problem, Hinton’s backpropagation, the Universal Approximation Theorem, the introduction of CNNs by LeCun, 2012’s AlexNet, GANs by Goodfellow, and modern-day Transformers. With every step of the way, there were improvements in architecture, optimization, and training techniques. As complexity increased with every new discovery, so did the methods to identify new and improved techniques to enhance the utility of these designs across different domains. It is often the case with neural network training that the improvement in the feedforward network can come from improvement in the weights, the way we calculate gradients, activations, etc. With recent advancements, the ability to optimize at every step of the process became a mandatory requirement; every opportunity to optimize should be taken into consideration when it comes to training models. Before we jump ahead too much, let us establish a couple of definitions first before we get into the main topic that is being addressed.

Definition 2 (The Perceptron)
The Perceptron is a feedforward neural network with depth \(T = 1\) and a specific non-linear activation \(\sigma\). Following the recursion from Definition 1, let \(T = 1\). Then:

Input Space: \(\mathcal{X} \subset \mathbb{R}^{n}\).

Linear Transformation: Since \(T = 1\), we move immediately to the final output calculation. For each \(x \in \mathcal{X}\), set \(h_0(x) := x\). The logit is computed as: \[z_1(x) := f_1(W_1, h_0(x)) + b_1\] Here, \(f_1: \mathbb{R}^{n \times n} \times \mathbb{R}^{n} \to \mathbb{R}^{n}\) is a bilinear map (specifically the inner product), \(W_1 \in \mathbb{R}^{n \times n}\), and \(b_1 \in \mathbb{R}^{n}\).

Activation Function: The Perceptron specifically uses the Heaviside step function as its non-linearity \(\sigma: \mathbb{R}^{n} \to \mathbb{R}^{n}\): \[\sigma(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}\]

Final Output: The network function \(f: \Theta \times \mathcal{X} \to \mathcal{Y}\) is defined as: \[f(\theta, x) := \sigma(z_1(x))\]
w₁ w₂ w₃ x₁ x₂ x₃ Σ + b step ŷ

Although the perceptron was revolutionary during its times, there were quite a few problems: the problem of linearly separable spaces, no hidden layers, etc. Although it sparked huge discussions in the scientific community, it lacked quite a bit when it comes to computing even simple functions such as xor, etc. This eventually led to the discovery of feed‑forward neural networks by adding multiple layers, and with the introduction of non‑linearity, the problem was solved. This eventually led to a fundamental result — which is groundbreaking, to say the least — the discovery of the universal approximation theorem.

Theorem 1.2 (Universal Approximation Theorem)
Let σ be a non‑constant, bounded, and continuous activation function (e.g., sigmoid or tanh). For any continuous function f on a compact subset K ⊂ ℝn and any ε > 0, there exists a feedforward neural network with a single hidden layer, containing a finite number of neurons, such that the network function F satisfies \[ \max_{x \in K} |F(x) - f(x)| < \varepsilon. \] In other words, shallow feedforward networks are dense in the space of continuous functions over compact sets.
Universal Approximation Theorem – Visual Intuition A single hidden layer can approximate any continuous function on a compact set Input Hidden (shallow) Output x₁ x₂ x₃ x₄ b₁ σ(·) σ(·) σ(·) σ(·) σ(·) σ(·) Σ linear F(x) Approximation in action x (input) f(x), F(x) Target f(x) (continuous) Network F(x) (approximation) As #hidden neurons → ∞, error → 0 (denseness in C(K)) compact support K ⊂ ℝ

Feed forward Neural Networks

Shallow feedforward networks (one hidden layer) can approximate any continuous function on a compact (closed and bounded) subset of ℝⁿ to any desired accuracy, provided they have enough neurons in that hidden layer. Given enough neurons, you can possibly model any smooth function. But in terms of practically applying it for complex cases, it does not guarantee generalizability to new data, can be trained easily, which simply shows the expressivity of the neural network now moving on to the optimization.

Definition 1 (Feedforward Neural Network)
A feedforward neural network of depth \(T \geq 1\), with input space \(\mathcal{X} \subset \mathbb{R}^{n}\), output space \(\mathcal{Y} \subset \mathbb{R}^{K}\) (e.g., logits), and parameter space \(\Theta \subset \mathbb{R}^{p}\), is a parameterized function \(f: \Theta \times \mathcal{X} \to \mathcal{Y}\) defined by the following recursion:

Input Space: \(\mathcal{X} \subset \mathbb{R}^{n}\).

Recursion: For each \(x \in \mathcal{X}\), set \(h_0(x) := x\). For each layer \(t = 1, \dots, T+1\): \[ \begin{aligned} z_t(x) &:= f_t(W_t, h_{t-1}(x)) + b_t, \\ h_t(x) &:= \sigma(z_t(x)), \end{aligned} \] where \(f_t: \mathbb{R}^{d_t \times d_{t-1}} \times \mathbb{R}^{d_{t-1}} \to \mathbb{R}^{d_t}\) is a bilinear map (e.g., dense matrix multiplication, convolution, etc.), \(W_t \in \mathbb{R}^{d_t \times d_{t-1}}\) are the weights, and \(b_t \in \mathbb{R}^{d_t}\) the bias. The final output is taken as: \[ f(\theta, x) := z_{T+1}(x). \]

Activation Function: The nonlinearity \(\sigma: \mathbb{R}^{d_t} \to \mathbb{R}^{d_t}\) is applied element‑wise (e.g., ReLU, sigmoid, tanh) and is the same for all hidden layers \(t = 1, \dots, T\). The output layer \(t = T+1\) typically uses no activation (i.e., \(\sigma\) is the identity) unless otherwise specified.

Final Output: The network function \(f: \Theta \times \mathcal{X} \to \mathcal{Y}\) is given by: \[ f(\theta, x) = z_{T+1}(x), \] with the parameter set \(\theta = (W_t, b_t)_{1 \le t \le T+1}\).
Input Layer Hidden Layer Output Layer w₁₁ w₂₂ w₃₃ v₁ v₂ v₃ x₁ x₂ x₃ Σ σ Σ σ Σ σ Σ (linear) ŷ

ERM

ERM is the idea of picking the parameters for the data that you have so that it works well when the data is unseen (not the real risk — we minimize the empirical risk). The problem with ERM is that it just tells us to fit to the data well but doesn't tell us what to do in cases when there can be infinite paths to achieve the minima or the trajectory we need to take (or simply put, it works well for regression but not for neural nets).

Paradigm 1.2 (Empirical Risk Minimization — Classical)
\[ \theta^* \in \arg\min_{\theta} \; \mathbb{E}_{z \sim \mathcal{D}}[L_\theta(z)] \] where \( \mathcal{D} \) is a finite dataset of size \( n \), and \( L_\theta(z) \) is a loss parametrized by \( \theta \). The goal is to find any global minimizer of the empirical risk.

In the case of overparameterized models, there are infinitely many ways to reach the global minima, but classical ERM does not provide any principle for which global minimum to select when many exist. This is why we introduce the following paradigm: real optimizers like mini‑batch SGD run for a finite amount of time, use early stopping, and operate under finite precision. This introduces some amount of bias in selecting the minima.

Remark
For overparameterized neural networks, the empirical risk landscape has many global minima that all achieve zero training loss. The optimizer (not just the loss) determines which minimum is found — and thus whether the model generalizes.
Empirically, this bias often favours flatter curvatures, thereby selecting a specific global minimum. Implicit regularization is not a replacement for ERM but a necessary refinement. Even with this, there are several problems we could encounter, such as lack of precise characterization – because of the heavy dependence on step size, batch size, initialization, etc. – as well as algorithmic and hyperparameter sensitivity. A drastic change in the selected global minimum can occur when switching optimizers, altering learning rate or batch order, etc. Numerical precision issues can also arise, favoring some solutions over others, and in certain cases this can lead to poor generalization. Additionally, we cannot precisely study every effect of changing each setting, as finding minima is itself a computationally expensive process. Most of these issues arise either due to the intrinsic algorithm‑dependent bias, the tractability of the problem, or heavy dependence on myriad hyperparameter choices.

Paradigm 1.3 (Implicit Regularization of Stochastic Optimization)
\[ \theta_T \sim \mathcal{A}\bigl( \theta \mapsto \nabla_\theta \mathbb{E}_{z \sim \mathcal{D}}[L_\theta(z)] \bigr) \] Here \( \mathcal{A} \) is a stochastic algorithm (e.g., SGD with mini‑batches) that runs for finite time \( T \) and returns \( \theta_T \). The optimizer’s trajectory, early stopping, and numerical precision induce an implicit bias, selecting a specific global minimum among infinitely many.
⚠ Important Note
The implicit bias depends crucially on algorithmic details: momentum, adaptive learning rates (Adam), gradient clipping, and even floating‑point precision (float16 vs float32) act as regularizers.
From ERM to Implicit Regularization — A Conceptual Shift The optimizer’s bias replaces the search for arbitrary minima Classical ERM θ* ∈ arg min 𝔼[L] Infinitely many global minima (all fit training data perfectly) Generalization ? unknown role of optimizer ↓ Implicit Regularization θ_T ∼ A( θ → ∇𝔼[L] ) θ_T Optimizer bias picks a “good” minimum → generalization early stopping, SGD noise, precision ERM: θ* ∈ arg min R(θ) vs Implicit: θ_T = Algorithm(∇R, T) The optimizer’s dynamics (finite steps, stochastic gradients, low precision) implicitly regularize toward solutions with favorable properties.

One of the key problems we have witnessed in all optimizers—both in terms of weights and outputs—is how they change when a slight perturbation occurs. In the following section, we introduce some concepts and definitions that will be useful for studying this further.

Lipschitz Continuity

The Lipschitz constant allows us to study the sensitivity of the input to the output. If any of our model trained has smaller values, it essentially means even slight changes in the input space will lead to drastic changes in the output space. This result is particularly profound in the cases of alignment. A small change in the input misspecification could essentially lead to unfair reward advantage. It also prevents reward hacking or specification gaming; it also ensures that the model becomes increasingly generalizable.

Definition 2 (Lipschitz Constant)
The function \(f: \mathbb{R}^m \to \mathbb{R}^n\) is said to be \(L\)-Lipschitz for the \(\ell_2\) norm if for every \(x, y \in \mathbb{R}^m\) we have: \[ \| f(x) - f(y) \|_2 \leq L \| x - y \|_2. \tag{1.3} \]
Per Rademacher's theorem (Simon et al., 1983), its gradient is bounded: \(\|\nabla f\| \leq L\). Reciprocally, continuous functions with gradient bounded by \(L\) are \(L\)-Lipschitz.

Another tool that will come handy is related to the convexity of a function.
Visualizing the Lipschitz Condition Input space \(x\) \(f(x)\) \(f\) (L-Lipschitz) \(x\) \(y\) \(f(x)\) \(f(y)\) \(\|x - y\|\) \(\|f(x)-f(y)\|\) Lipschitz inequality: output change ≤ L × input change The Slope Bound \(x\) \(f(x)\) Lipschitz \(f\) slope = \(+L\) slope = \(-L\) Function must stay inside the cone with slope \(\pm L\) at every point

Convex functions

Next, we move on to an important class of functions called as convex functions. A convex function is essentially shaped like a valley. These types of functions come with several nice properties that are particularly useful for our case. First, the line test: any straight line through the graph stays above the graph at all times (or touches it). There are no hidden valleys — the only lowest point is the global minimum. The tangent line always lies under the curve. The function always has a "U" shape; any form of stretching or shifting still preserves convexity. Because of these unique properties, convex functions are smooth, making them particularly interesting and useful in neural networks.

Definition 3 (Convex Function)
The function \(f: \mathbb{R}^m \to \mathbb{R}\) is said to be convex if and only if for all \(x, y \in \mathbb{R}^m\), for all \(\lambda \in [0, 1]\), the following inequality holds: \[ f(\lambda x + (1 - \lambda)y) \leq \lambda f(x) + (1 - \lambda)f(y). \tag{1.4} \]
Furthermore, if \(f\) is differentiable, then its gradient \(\nabla_x f\) is cyclically monotone (Peyré et al., 2017). Finally, if \(f\) is twice-differentiable its Hessian \(H_x f\) is a positive semidefinite matrix.

In the following, the notation \(\|\cdot\|_2\) will denote the common Euclidean norm for vectors, i.e.: \[ \|x\|_2 = \sqrt{\sum_i x_i^2}, \tag{1.5} \] whereas for linear operators \(x \mapsto W x\) it will denote the spectral norm: \[ \|W\|_2 = \max_{\|x\|_2 \leq 1} \|W x\|_2 = \sigma_{\max}, \tag{1.6} \] where \(\sigma_{\max}\) is the largest singular value in the SVD of \(W\).
Convex Function The line segment between any two points lies above the graph \(x\) \(f(x)\) \(f(x)\) (convex) \(x\) \(f(x)\) \(y\) \(f(y)\) \(\lambda x + (1-\lambda)y\) Chord lies above graph \(\Rightarrow f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)\) If \(f\) is differentiable: \(\nabla f\) is cyclically monotone   |   If \(f\) is twice-differentiable: Hessian \(H_f \succeq 0\) (positive semidefinite)

Problem of optimal trasnport

The theory of optimal transport, pioneered by Gaspard Monge in 1781 and revolutionized by Leonid Kantorovich in the 1940s, reveals that the problem of finding the most economical way to transport mass from one distribution to another is intimately and inexorably linked to the geometry of distance metrics and the analytic structure of convex functions. The optimal transport problem begins with a simple and intuitive premise: given two probability distributions P and Q over a space Rm. what is the minimal expected cost required to reshape P into Q The optimal transport problem thus provides a mathematically rigorous way to compare probability distributions by respecting the underlying geometry of the space, with profound applications

Definition 4 (Wasserstein‑p Distance)
Let \(d: \mathbb{R}^m \times \mathbb{R}^m \to \mathbb{R}\) be a metric. For any two measures \(P\) and \(Q\) on \(\mathbb{R}^m\) the Wasserstein‑p distance is defined by the following optimization problem: \[ W_p^p(P, Q) := \inf_{\pi \in \Pi(P, Q)} \int_{\mathbb{R}^m \times \mathbb{R}^m} \bigl(d(x, y)\bigr)^p \, d\pi(x, y) \tag{1.7} \] where \(\Pi(P, Q)\) denotes the set of measures on \(\mathbb{R}^m \times \mathbb{R}^m\) whose marginals are \(P\) and \(Q\) respectively. Equivalently we can write: \[ W_1(P, Q) := \inf_{\substack{\text{Law}(X)=P \\ \text{Law}(Y)=Q}} \mathbb{E}[d(X, Y)]. \tag{1.8} \]
The right‑hand side of (1.7) corresponds to a Kantorovich problem. It is often presented as a relaxation of the Monge problem: \[ \inf_{T \# P = Q} \int_{\mathbb{R}^m} d(x, T(x)) \, dp(x). \tag{1.9} \] In this context, the (optimal) \(T: \mathbb{R}^m \to \mathbb{R}^m\) is called the (optimal) Monge map. For arbitrary measures, the two problems are different. The Kantorovich problem always admits a solution. This is not the case with the Monge problem. For example, for discrete measures with supports of size \(|P| < |Q|\), it is impossible to fulfill the pushforward condition \(T\#P = Q\) since \(|T\#P| \le |P| \le |Q|\). Nonetheless, when both \(P\) and \(Q\) admit a density (w.r.t. Lebesgue measure) then the two problems happen to be equivalent. In this case, the Monge problem corresponds to a deterministic coupling \(\pi(x, \cdot) = \delta_{T(x)}\).

By Remark 6.3 of Villani (2008) the Wasserstein‑1 distance is the Kantorovich‑Rubinstein distance: \[ W_1(P, Q) = \sup_{f \in \text{Lip}_1(X, \mathbb{R})} \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{z \sim Q}[f(z)]. \tag{1.10} \]
In our case we are working with neural networks that are Lipschitz w.r.t. \(\ell_2\) distance, so we have \(d(x, y) := \|x - y\|_2\). This chapter is one of the first motivations for the design of Lipschitz networks, and a strong inspiration for most of the work throughout the thesis, notably chapters...
Wasserstein‑1 Distance: Optimal Transport Minimal cost to move probability mass from distribution P to Q Source \(P\) Transport mass Target \(Q\) Kantorovich problem (relaxed) \(\inf_{\pi \in \Pi(P,Q)} \int d(x,y) \, d\pi(x,y)\)   → always has a solution When \(P,Q\) have densities → equivalent to Monge problem: deterministic transport map \(T\) Kantorovich‑Rubinstein: \(W_1(P,Q) = \sup_{\|f\|_{\text{Lip}}\le 1} \mathbb{E}_{P}[f] - \mathbb{E}_{Q}[f]\)
Definition 5 (Dual Formulations of Optimal Transport)
We can also mention a useful duality result related to Wasserstein‑2 distance (see Peyré et al. (2017) or Korotin et al. (2021) for example): \[ W_2^2(P, Q) = \sup_{f \in \text{Convex}(X, \mathbb{R})} \mathbb{E}_{x \sim P}[f(x)] + \mathbb{E}_{z \sim Q}[f^c(z)], \tag{1.11} \] where \(f^c\) is the \(c\)-transform defined as: \[ f^c(x) := \min_{y} \; c(x, y) - f(y) \tag{1.12} \] for \(c(x, y) = \|x - y\|\). This result will not be used as‑is, but it shows the connections between optimization over convex functions on one side and optimal transport on the other side. This is explored further in chapter 7.

Binary cross entropy loss

Binary Cross‑Entropy (BCE) Loss – The Confidence‑Based Scorer Imagine you are teaching a child to tell cats from dogs. You show a picture and ask: “How sure are you that this is a cat?” The child answers with a number between “definitely a dog” and “definitely a cat.” The BCE loss measures how surprised you should be by that answer. In machine learning, a binary classifier usually produces a raw score (called a logit) and then squeezes it through a soft, S‑shaped curve (the logistic function) to turn it into a probability between 0 and 1. The BCE loss then looks at the true label (say, 1 for cat, 0 for dog) and the predicted probability. If the predicted probability matches the true label very well (e.g., 0.99 for “cat” when the picture really is a cat), the loss is tiny. If the prediction is wrong (e.g., 0.01 for “cat” when it is actually a cat), the loss becomes enormous – the loss grows faster and faster as the mistake gets worse. There is an extra knob called temperature (often written as τ). Turning up the temperature makes the network more confident: even a small positive score gets pushed close to 0 or 1. Turning it down makes the predictions softer, more hesitant. Why does that matter? If the network is already constrained to be gently changing (Lipschitz networks), the temperature becomes a powerful tool to control how “sharp” or “smooth” the final decisions are. A lower temperature often improves generalization – the network does not scream “I am 100% sure” on every training example, so it stays more flexible on new data. BCE is loved because it gives smooth, well‑behaved gradients (the directions in which to adjust the network) and naturally produces probabilities that can be interpreted as confidence. It is the standard choice for binary classification in almost all modern deep learning frameworks.

Hinge loss

Now think of a different teacher. Instead of asking “how sure are you?”, they ask: “Is your decision separated by a clear border?” The Hinge loss is the brain behind Support Vector Machines, but it works equally well in neural networks. The Hinge loss cares about a margin – a buffer zone around the decision boundary. For a given training example, the network produces a raw score (positive for one class, negative for the other). The loss checks whether that score is far enough away from zero in the correct direction. If the score is, say, +5 for a cat picture and the margin is set to 1, then the loss is zero – the example is safe and well on the correct side. If the score is +0.5, still correctly positive but inside the margin, the loss is a positive number proportional to how much we still need to push it outward. If the sign is wrong (negative for a cat), the loss gets even bigger. The size of the margin (commonly written as m) is a hyper‑parameter. A larger margin demands that examples be farther from the decision boundary. That is great for robustness: if a small perturbation to the input cannot move the score across zero because you have a thick margin, the classifier is inherently more stable. Lipschitz networks pair beautifully with Hinge loss because the Lipschitz constant tells you exactly how far a small input change can move the output. The margin divided by the Lipschitz constant gives a guaranteed radius of safety – no misclassification within that radius. Unlike BCE, Hinge loss is piecewise linear. It is completely flat for examples that already satisfy the margin – such examples stop influencing training. That creates a sparse set of “support vectors” (only the violators matter), which can be efficient but sometimes leads to slower learning because gradient signals become sparse.

Definition 6 (Binary Cross‑Entropy and Hinge Losses)
We detail below the losses (and activations) that will be used in this work.

The Binary Cross‑Entropy (BCE) loss (also called log loss) is among the most popular choices of loss within the deep learning community. Let \(f: \mathbb{R}^n \to \mathbb{R}\) be a neural network. For an example \(x \in \mathbb{R}^n\) with label \(y \in \mathcal{Y}\), and \(\sigma(x) = \frac{1}{1 + \exp(-x)}\) the logistic function mapping logits to probabilities, the BCE is written: \[ L_{\tau}^{\text{bce}}(f(x), y) = -\log \sigma\bigl(y \tau f(x)\bigr), \] with (inverse) temperature / scaling parameter \(\tau > 0\). This hyper‑parameter of the loss defaults to \(\tau = 1\) in most frameworks such as TensorFlow or PyTorch. Note that \(L_{\tau}^{\text{bce}}(f(x), y) = L_{1}^{\text{bce}}(\tau f(x), y)\), so we can equivalently tune \(\tau\) or the Lipschitz constant \(L\). We show in Chapter 4 that for LipNet the temperature \(\tau\) allows to control the generalization gap.

The Hinge loss \[ L_{m}^{\text{H}}(f(x), y) = \max\bigl(0,\; m - y f(x)\bigr) \] with margin \(m > 0\) is also of particular interest.

Applying lipschitz constant to classification settings

Definition (Lipschitz Constant and Classifier)
Let \(\mathcal{X} \subseteq \mathbb{R}^d\) be the input space. For a function \(f: \mathcal{X} \to \mathbb{R}\) (binary case) or \(f: \mathcal{X} \to \mathbb{R}^K\) (multiclass case), define the Lipschitz constant \[ L = \sup_{x \neq y} \frac{\|f(x) - f(y)\|_2}{\|x - y\|_2} \] (for the binary case) or the matrix Lipschitz constant for the multiclass case (same definition with \(\|\cdot\|_2\) on outputs).

A classifier is defined as:
  • Binary: \(h(x) = \operatorname{sign}(f(x))\)
  • Multiclass: \(\hat{k}(x) = \arg\max_{k \in \{1,\dots,K\}} f_k(x)\)
The robustness radius at an input \(x\) (ignoring ground truth) is \[ r(x) = \inf_{\delta: \hat{k}(x+\delta) \neq \hat{k}(x)} \|\delta\|_2. \]

Binary case: The network produces a single number. If that number is positive, the classifier says “class A”; if negative, it says “class B”. The magnitude of the number (how far away from zero it is) is like a confidence or margin. Multiclass case: The network produces a list of numbers, one for each class (e.g., cat, dog, bird). The classifier picks the class with the highest number. The gap between the highest number and the second highest is the margin – a measure of how clearly the network prefers one class over the others.

note The robustness radius for that input is the smallest distance (measured in the usual Euclidean sense) you need to change the input so that the classifier’s answer flips to something else If the radius is large, the classifier is robust: you have to change the image a lot before the answer changes. If the radius is very small, the classifier is fragile: a tiny imperceptible perturbation can fool it.
Property 1 (Binary Local Robustness Certificate)
If \(f \in \operatorname{Lip}_L(\mathcal{X}, \mathbb{R})\), then for any \(x \in \mathcal{X}\), \[ r(x) \;\ge\; \frac{|f(x)|}{L}. \] That is, any perturbation \(\delta\) with \(\|\delta\|_2 < \frac{|f(x)|}{L}\) cannot change the sign of \(f\).
Property 2 (Multiclass Local Robustness Certificate)
If \(f \in \operatorname{Lip}_L(\mathcal{X}, \mathbb{R}^K)\) with respect to the Euclidean norm on both input and output spaces, then for any \(x \in \mathcal{X}\) with predicted class \(\hat{k} = \arg\max_k f_k(x)\), \[ r(x) \;\ge\; \frac{f_{\hat{k}}(x) \;-\; \max_{i \neq \hat{k}} f_i(x)}{\sqrt{2}\, L}. \] Define the multiclass margin at \(x\): \[ m(x) = f_{\hat{k}}(x) - \max_{i \neq \hat{k}} f_i(x) \qquad (>0 \text{ for correctly classified points}). \] Then \[ r(x) \ge \frac{m(x)}{\sqrt{2}\, L}. \] If \(x\) is misclassified (\(m(x) < 0\)), the certificate gives a negative lower bound; in practice the robustness radius is taken as \(0\) for misclassified examples.

Lipschitz constrained neural network

In a lipschitz constrained neural network Each individual layer (whether it’s an affine layer or an activation) has its own local Lipschitz constant – a number that tells you the maximum amplification of distances that this layer can produce. For example, if you move the input to that layer by one unit, the output can move at most by that layer’s Lipschitz constant. For activation functions like ReLU or sigmoid, this constant is at most 1 (they do not stretch distances). For an affine layer (multiplying by a weight matrix and adding a bias), the Lipschitz constant equals the largest singular value of the weight matrix – essentially the maximum factor by which the layer can stretch any direction.

Definition 6 (Lipschitz constrained feed-forward neural network)
Let \(\mathcal{X} \subset \mathbb{R}^n\) be the input space, \(\mathcal{Y} \subset \mathbb{R}^K\) the output space, and \(\Theta \subset \mathbb{R}^p\) the parameter space. A feed-forward neural network of depth \(D \in \mathbb{N}\) is a parameterized function \(f: \Theta \times \mathcal{X} \to \mathcal{Y}\) defined by the sequential composition of layers \(f_1, \dots, f_D\): \[ f(\theta, x) = \bigl( f_D(\theta_D) \circ f_{D-1}(\theta_{D-1}) \circ \cdots \circ f_1(\theta_1) \bigr)(x), \] where \(\theta = (\theta_1, \dots, \theta_D) \in \Theta\) are the layer parameters. For an affine layer, \(\theta_d = (W_d, b_d)\) consists of a weight matrix and bias vector; for an activation function, \(\theta_d = \emptyset\) (no parameters).

A Lipschitz-constrained neural network is a feed-forward network such that for each layer \(d \in \{1,\dots,D\}\), the mapping \(x_d \mapsto f_d(\theta_d, x_d)\) is \(L_d\)-Lipschitz for all \(\theta_d\), i.e., \[ \forall x_d, y_d \in \mathbb{R}^{n_d}, \quad \| f_d(\theta_d, x_d) - f_d(\theta_d, y_d) \|_2 \le L_d \| x_d - y_d \|_2. \] Then the overall function \(x \mapsto f(\theta, x)\) is \(L\)-Lipschitz with \[ L \le \prod_{d=1}^D L_d \qquad \forall \theta \in \Theta. \] Let \(\text{LipNet}_1\) denote the class of Lipschitz-constrained networks for which the global Lipschitz constant satisfies \(L = 1\). For simplicity, all networks in this manuscript are considered \(L\)-Lipschitz with respect to the Euclidean \(\ell_2\)-norm.

Lower bounds

Definition 7 (Empirical lower bound for Lipschitz constant)
Let \(f: \mathbb{R}^m \to \mathbb{R}^n\) be a neural network and let \(S \subset \mathbb{R}^m\) be a finite set of sample points. The true (global) Lipschitz constant is \[ L = \sup_{x \neq y \in \mathbb{R}^m} \frac{\|f(x)-f(y)\|_2}{\|x-y\|_2}. \] Any estimate computed exclusively from \(S\) yields a lower bound \(\widehat{L}_{\text{emp}} \le L\). Three common empirical proxies are:
  • Maximum gradient norm (for differentiable \(f\)): \[ \widehat{L}_{\nabla} = \max_{x \in S} \|\nabla_x f(x)\|_2. \]
  • Maximum pairwise finite difference: \[ \widehat{L}_{\text{pair}} = \max_{\substack{x,y \in S \\ x \neq y}} \frac{\|f(x)-f(y)\|_2}{\|x-y\|_2}. \]
  • Maximum over adversarial perturbations: \[ \widehat{L}_{\text{adv}} = \max_{x \in S} \frac{\|f(x)-f(x+\delta_x)\|_2}{\|\delta_x\|_2}, \] where \(\delta_x\) is an adversarial attack (e.g., projected gradient descent) maximizing \(\|f(x+\delta)-f(x)\|_2\) subject to \(\|\delta\|_2 \le \varepsilon\).
Because the supremum is taken over the whole domain, any such estimate from a finite subset is a lower bound: \[ \widehat{L}_{\text{emp}} \le L. \] These estimates are not suitable for certified robustness but can provide insights into training dynamics and generalization.

Formal methods for Upper bounds

Usage * Click on the below cards to view the defnition
upper bound · global

AutoLip

Layer‑wise composition bound: spectral norm product → fast global Lipschitz estimate.

local bound · linear

Linear Relaxation

Decomon / auto‑LiRPA: local Lipschitz via convex over‑approximations of activations.

SDP · global

LipSDP

Semidefinite programming relaxation – tight global Lipschitz for many activations.

ReLU · exact

Polyhedron Propagation

Forward propagation of convex polyhedra → max spectral norm over activation patterns.

DC programming

LiPopt

Difference‑of‑convex programming: sequential convex optimization for tight upper bounds.



Next we move on to discussing how to enforce Lipschitz bounds in practice.

Enforcing Lipschitz Bounds


Let \(f: \mathbb{R}^m \to \mathbb{R}^n\) be a neural network. To enforce a global Lipschitz bound in practice, we use activations with Lipschitz constant \(l_d\) and apply a constraint \(\Pi : \mathbb{R}^p \to \Theta\) on the weights of affine layers. For an affine layer \(f_d(z) = W_d z + b_d\), the minimal Lipschitz constant is the spectral norm \[ l_d = \|W_d\|_2 := \max_{\|x\|_2 \le 1} \|W_d x\|_2, \] where \(\|x\|_2\) and \(\|W_d x\|_2\) are Euclidean (vector) norms. The constraint set is \[ \Theta = \{\, W_d \mid \|W_d\|_2 \le l_q \,\}. \]

The seminal work of Anil et al. (2019) proved that universal approximation in the set of \(L\)-Lipschitz functions is achievable with certain weight constraints and the GroupSort activation function. GroupSort operates on pairs of consecutive neurons, not elementwise. For all coordinates \(i\) (with zero‑based indexing), it is defined as: \[ \operatorname{GroupSort}_2(x)_{2i,\,2i+1} = \bigl[ \min(x_{2i}, x_{2i+1}),\; \max(x_{2i}, x_{2i+1}) \bigr]. \]

The theorem of Anil et al. (2019) bounds the “two‑to‑infinity” norm \(\|\cdot\|_{2\to\infty}\) of the first layer and the \(\|\cdot\|_\infty\) norms of further layers to obtain universal approximation in \(\operatorname{Lip}_1(\mathcal{X}, \mathbb{R})\). These networks are \(1\)-Lipschitz by construction. Because \(\operatorname{Lip}_L(\mathcal{X}, \mathbb{R}^K) = \{ L f \mid f \in \operatorname{Lip}_1(\mathcal{X}, \mathbb{R}^K) \}\), the same architecture can approximate any \(L\)-Lipschitz function after scaling the output by \(L\).

Example (Approximating \(|x|\) on \([-1,1]\)). Consider the target \(f^*(x)=|x|\). Construct a two‑layer network:
  • First affine layer: \(W_1 = [0.5,\,-0.5,\,0.5,\,-0.5]^{\mathsf{T}}\), \(b_1=0\). Then \(\|W_1\|_2 = 1\) and \(\|W_1\|_{2\to\infty}\) is bounded.
  • GroupSort on two pairs yields \(h = [-0.5|x|,\,0.5|x|,\,-0.5|x|,\,0.5|x|]^{\mathsf{T}}\).
  • Second layer: \(W_2 = [0,1,0,0]\), \(b_2=0\). Its infinity norm is \(\|W_2\|_\infty = 1\).
The network output is \(f(x)=0.5|x|\), which is 1‑Lipschitz and approximates \(f^*\) with factor \(0.5\). By using more neurons and deeper layers (e.g., residual refinement or parallel blocks with a final linear combination respecting \(\|W\|_\infty\le 1\)), the factor can be made arbitrarily close to 1, demonstrating universal approximation.
Output after \(k\) refinement steps: \(f_k(x) = \bigl(1 - 2^{-k}\bigr)|x|\) achieving \(\sup|f_k - f^*| = 2^{-k}\).
(approx)

Lipschitz Constraint Methods

Usage * Click on the method links below to view definitions and further details.
MethodTypeGuaranteeKey Property
Orthogonal weights (GNP) Hard constraint \(\|W\|_2 = 1\) Gradient norm preserving
Spectral normalization Hard constraint \(\|W\|_2 = 1\) Power iteration (lower bound)
Differentiable reparametrization Hard constraint \(\|W\|_2 \le L_d\) Unconstrained optimization
Projected gradient descent Hard constraint \(\|W\|_2 \le L_d\) Projection onto \(\Theta\)
Weight clipping Hard constraint \(\|W\|_2 \le c\sqrt{\min(n,m)}\) Coefficient‑wise projection onto \([-c,c]\)
Frobenius normalization Hard constraint \(\|W\|_2 \le 1\) Projection \(W \leftarrow W / \|W\|_F\)
Direct parametrization (dynamical) Hard constraint Whole block 1‑Lipschitz \(z = x - \frac{2}{\|W\|_2^2} W^{\mathsf{T}}\sigma(Wx+b)\)
Sandwich layer (QP formulation) Hard constraint \(\| \text{block} \|_2 \le 1\) \(x \mapsto W_2^{\mathsf{T}} \sigma(W_1 x + b)\)
Gradient penalty Soft penalty None (crude upper bound) Nested differentiation
Spectral regularization Soft penalty None Penalty on \(\|W_i\|_2^2\)
Orthogonal weights & Gradient Norm Preserving (GNP) networks

A weight matrix \(W \in \mathbb{R}^{n \times m}\) is orthogonal if \(W^{\mathsf{T}}W = I\) (when \(n \ge m\)) or \(WW^{\mathsf{T}} = I\) (when \(m \ge n\)). Then \(\|W\|_2 = 1\) and \(\|Wx\|_2 = \|x\|_2\) for all \(x\).

When all layers (including 1‑Lipschitz activations like GroupSort) have orthogonal Jacobians almost everywhere, the whole network satisfies

\[ \|\nabla_x f(x)\|_2 = 1 \quad \text{for almost every } x. \]

Such networks are called Gradient Norm Preserving (GNP) or Eikonal networks. They avoid vanishing/exploding gradients because the gradient norm is exactly 1. Examples: orthogonal weight matrices, Householder activations, GroupSort, orthogonal convolutions.

Open problem. Universal approximation with purely orthogonal networks (all weights orthogonal) is not yet proven.
Spectral normalization (Miyato et al., 2018)

Project each weight matrix onto the set \(\Theta = \{ W \mid \|W\|_2 = 1 \}\) via

\[ \Pi_{\text{spec}}(W) = \frac{W}{\|W\|_2}. \]

The spectral norm \(\|W\|_2\) is the largest singular value. It is estimated using power iteration:

\[ v^{(k)} = \frac{W^{\mathsf{T}} u^{(k-1)}}{\|W^{\mathsf{T}} u^{(k-1)}\|_2}, \quad u^{(k)} = \frac{W v^{(k)}}{\|W v^{(k)}\|_2}, \quad \sigma_{\max} \approx (u^{(K)})^{\mathsf{T}} W v^{(K)}. \]

Power iteration converges from below, giving a lower estimate – problematic for certified upper bounds. Recent work (Delattre et al., 2023b) provides upper bounds using only matrix‑vector products.

Differentiable reparametrization (trivialization)

Let \(\theta \in \mathbb{R}^p\) be unconstrained parameters. Define \(\widetilde{W} = \Pi(\theta) \in \Theta\) where \(\Pi\) is a differentiable map onto the feasible set (e.g., spectral normalisation). The forward pass uses \(\widetilde{W}\). Gradients are backpropagated through \(\Pi\) to \(\theta\):

\[ \nabla_\theta \mathcal{L} = \left( \frac{\partial \mathcal{L}}{\partial \widetilde{W}} \right)^{\mathsf{T}} \cdot \frac{\partial \widetilde{W}}{\partial \theta}. \]

This turns training into an unconstrained optimisation problem on \(\mathcal{L} \circ f \circ \Pi\). Examples: Miyato et al. (2018), Anil et al. (2019).

Projected gradient descent (PGD)

Take a gradient step then project onto the feasible set \(\Theta = \{ W \mid \|W\|_2 \le L_d \}\):

\[ W_{t+1/2} = W_t - \eta \nabla_W \mathcal{L}(f_{W_t}), \qquad W_{t+1} = \Pi_\Theta(W_{t+1/2}). \]

The projection \(\Pi_\Theta\) for spectral norm constraints is done by clipping singular values to \(L_d\). This is simple but may require many iterations. Early example: weight clipping (Arjovsky et al., 2017) which clips each coefficient to \([-c,c]\).

⚠️
Weight clipping yields a very loose bound (\(\|W\|_2 \le c\sqrt{\min(n,m)}\)) and may cause vanishing gradients.
Weight clipping (Arjovsky et al., 2017)

Project each weight coefficient onto the interval \([-c, c]\):

\[ \Pi_{\text{clip}}(W)_{ij} = \max(-c, \min(c, W_{ij})). \]

Because all norms are equivalent in finite dimension, this guarantees

\[ \|W\|_2 \le c \sqrt{\min(n,m)}. \]
⚠️
The bound is very loose; other singular values may collapse to zero, causing vanishing gradients when stacking layers.
Frobenius normalization (Salimans & Kingma, 2016)

Project onto the Frobenius sphere:

\[ \Pi_{\text{F}}(W) = \frac{W}{\|W\|_F}, \quad \|W\|_F = \sqrt{\sum_{i,j} W_{ij}^2}. \]

Since \(\|W\|_2 \le \|W\|_F\), we obtain \(\|\Pi_{\text{F}}(W)\|_2 \le 1\). This is tighter than weight clipping but still not optimal – a matrix with \(\|W\|_F = 1\) can have \(\|W\|_2 \ll 1\).

Direct parametrization – dynamical systems (Meunier et al., 2022b; Araujo et al., 2022)

Interpret the forward pass as a discretisation of a continuous dynamical system:

\[ \frac{\partial x_t}{\partial t} = -\nabla f_t(x_t) + A_t x_t, \] with \(f_t\) convex and \(A_t\) skew‑symmetric. This guarantees \(\|x_t - z_t\|_2\) is non‑increasing. Using the mid‑point Euler discretisation and a suitable parametrisation of \(f_t\), the update becomes

\[ z = x - \frac{2}{\|W\|_2^2} \, W^{\mathsf{T}} \sigma(W x + b), \] where \(\sigma\) is 1‑Lipschitz. The whole block is 1‑Lipschitz by construction. Special cases include Cayley convolutions and Householder activations.
Open problem. Universal approximation results for these architectures are still lacking.
Sandwich layer – quadratic program formulation (Wang & Manchester, 2023)

The Lipschitz constant of a linear operator satisfies

\[ \|W\|_2^2 = \max_{\|x\|_2=1} x^{\mathsf{T}} W^{\mathsf{T}} W x. \]

By controlling the coefficients and structure of this quadratic program, one can enforce an explicit upper bound on the Lipschitz constant. For a clever choice of structure, the implementation takes the form of a “sandwich” layer:

\[ x \mapsto W_2^{\mathsf{T}} \, \sigma(W_1 x + b), \] with constraints on \(W_1, W_2\) (e.g., spectral norm bounds) that guarantee the whole composition is 1‑Lipschitz. This often yields tighter bounds than products of individual spectral norms.

Gradient penalty (Gulrajani et al., 2017)

A soft regularization term that encourages a 1‑Lipschitz constraint:

\[ R_{\text{GP}} = \bigl( \|\nabla_x f(x)\|_2 - 1 \bigr)^2. \]

Disadvantages:

  • No formal guarantee – the Lipschitz constant may still blow up.
  • Requires nested differentiation (gradient of gradient), computationally expensive.

Used in WGAN-GP and Lipschitz GANs (Zhou et al., 2019).

Spectral regularization (Yoshida & Miyato, 2017)

Adds a penalty on the squared spectral norm of each weight matrix:

\[ R_{\text{spec}} = \lambda \sum_{i=1}^D \|W_i\|_2^2. \]

Unlike gradient penalty, it operates directly on the weights. However, it only encourages small Lipschitz constants without a hard guarantee – the optimizer may trade off between the loss and the penalty, potentially allowing a large local Lipschitz constant.

Note. Regularization approaches in general do not give formal guarantees, only a crude upper bound. For certified robustness, use hard constraints (spectral normalisation, orthogonality, direct parametrizations).

One effective way to enforce structural constraints—such as orthogonality, low rank, or unit norm—on weight vectors is to map them onto a manifold. By definition, a manifold is a space that locally resembles Euclidean space but may have a more complex global geometry; constraining weights to lie on a specific manifold guarantees that they satisfy the desired properties at every step of training. To implement this idea in practice, we must first give a precise mathematical definition of a manifold and then choose an appropriate distance metric that measures how far two points are on that manifold, as well as an optimization method that respects the manifold’s geometry. Different manifolds call for different metrics and optimizers; we now formalize these concepts and highlight the most relevant combinations for typical machine learning tasks.

Mathematical definition of a manifold

Definition 1 (Topological Manifold)
Let \(M\) be a topological space. \(M\) is called a topological manifold of dimension \(n\) (denoted \(n\)-manifold) if it satisfies the following three axioms:
  1. Locally Euclidean of dimension \(n\): For every point \(p \in M\), there exists an open neighbourhood \(U \subseteq M\) of \(p\), an open set \(\widetilde{U} \subseteq \mathbb{R}^n\), and a homeomorphism \(\varphi: U \to \widetilde{U}\). The pair \((U, \varphi)\) is called a chart (or coordinate chart) around \(p\). The homeomorphism \(\varphi\) is often written as \(\varphi(q) = (x^1(q), \dots, x^n(q))\), where \(x^i(q)\) are the local coordinates of \(q\).
  2. Hausdorff property: For any two distinct points \(p, q \in M\), there exist disjoint open sets \(U_p, U_q \subseteq M\) such that \(p \in U_p\) and \(q \in U_q\).
  3. Second-countability: The topology of \(M\) has a countable base. That is, there exists a countable collection \(\{B_i\}_{i \in \mathbb{N}}\) of open sets such that every open set in \(M\) can be expressed as a union of some of the \(B_i\).
The integer \(n\) is the dimension of the manifold. If the dimension is not required to be constant, one speaks of a manifold of possibly variable dimension, but typically each connected component has a fixed dimension.
Definition 2 (Atlas and Compatibility of Charts)
Let \(M\) be a topological \(n\)-manifold.
  • An atlas on \(M\) is a collection of charts \(\mathcal{A} = \{(U_\alpha, \varphi_\alpha)\}_{\alpha \in I}\) such that the domains cover \(M\): \(\bigcup_{\alpha \in I} U_\alpha = M\).
  • Two charts \((U_1, \varphi_1)\) and \((U_2, \varphi_2)\) are called \(C^\infty\)-compatible (or smoothly compatible) if either \(U_1 \cap U_2 = \emptyset\) or the transition map \[ \varphi_2 \circ \varphi_1^{-1}: \varphi_1(U_1 \cap U_2) \longrightarrow \varphi_2(U_1 \cap U_2) \] is a smooth diffeomorphism (i.e., infinitely differentiable with a smooth inverse) between open subsets of \(\mathbb{R}^n\).
  • A \(C^\infty\)-atlas is an atlas in which every two charts are \(C^\infty\)-compatible.

Matrix Manifolds for Weight Constraints

Usage * Click on the manifold names below to view mathematical definitions, tangent spaces, distance metrics, and associated optimizers.
ManifoldTypeDimensionKey PropertyTypical Optimizer
Stiefel \(\operatorname{St}(n,m)\) Hard constraint \(mn - n(n+1)/2\) \(X^\top X = I_n\) (orthonormal columns) Riemannian SGD, QR retraction
Grassmann \(\operatorname{Gr}(k,n)\) Hard constraint \(k(n-k)\) Subspaces: \(X \sim XQ,\; Q\in O(k)\) Riemannian CG, chordal distance
Oblique \(\operatorname{OB}(m,n)\) Hard constraint \(m(n-1)\) Each row has unit norm Row‑wise normalisation
Orthogonal \(O(n)\) Hard constraint \(n(n-1)/2\) \(Q^\top Q = I_n\) Cayley transform, exponential map
Fixed‑rank \(\mathcal{M}(m,n,r)\) Hard constraint \(r(m+n-r)\) \(\operatorname{rank}(X)=r\) Riemannian GD, SVD retraction
Flag \(\operatorname{Fl}(d_1,\dots,d_k,n)\) Hard constraint \(\frac12(n^2-\sum_i (d_i-d_{i-1})^2)\) Nested subspaces \(V_1\subset\cdots\subset V_k\) Hierarchical Riemannian CG
PSD cone \(\mathcal{S}_+^n\) Hard constraint (interior) \(n(n+1)/2\) \(X \succeq 0\) (positive semidefinite) Log‑Euclidean, Bures‑Wasserstein
Fixed‑rank PSD Hard constraint \(r(2n-r+1)/2\) \(X\succeq 0,\; \operatorname{rank}(X)=r\) Bures‑Wasserstein gradient
Finsler (spectral norm) Hard constraint (norm) \(mn\) (ambient) \(\|W\|_2 = 1\) (non‑Riemannian) Muon, Manifold Muon
Stiefel manifold \(\operatorname{St}(n,m)\)

\[ \operatorname{St}(n,m) = \{ X \in \mathbb{R}^{m \times n} : X^\top X = I_n \},\quad m \ge n. \]

Tangent space: \(T_X \operatorname{St} = \{ \Delta \in \mathbb{R}^{m\times n} : X^\top \Delta + \Delta^\top X = 0 \}\), i.e. \(X^\top\Delta\) is skew‑symmetric.

Riemannian metric: Induced Euclidean metric \(\langle \Delta_1,\Delta_2\rangle = \operatorname{tr}(\Delta_1^\top\Delta_2)\).

Retraction (QR): \(\operatorname{Retr}_X(\Delta) = \operatorname{qf}(X+\Delta)\), where \(\operatorname{qf}\) returns the orthogonal factor of the QR decomposition.

Geodesic distance: \(d_g(X,Y) = \sqrt{\sum_{i=1}^n \theta_i^2}\), with \(\theta_i\) the principal angles between the subspaces spanned by \(X\) and \(Y\).

Application. Orthogonal layers in neural networks (GNP), PCA, independent component analysis.
Grassmann manifold \(\operatorname{Gr}(k,n)\)

\[ \operatorname{Gr}(k,n) = \operatorname{St}(k,n) / O(k) = \{ [X] : X \in \mathbb{R}^{n\times k},\ X^\top X = I_k \}, \] where \([X] = \{ XQ : Q \in O(k) \}\) is the equivalence class of orthonormal bases spanning the same \(k\)-dimensional subspace.

Tangent space (horizontal lift): \(\mathcal{H}_X = \{ \Delta \in \mathbb{R}^{n\times k} : X^\top \Delta = 0 \}\).

Chordal distance: \(d_{\text{ch}}([X],[Y]) = \frac{1}{\sqrt{2}}\|XX^\top - YY^\top\|_F = \sqrt{\sum_{i=1}^k \sin^2\theta_i}\).

Riemannian gradient: For a function \(f([X])\) lifted to \(\tilde f(X)\) on Stiefel, \(\operatorname{grad} f([X]) = (I - XX^\top)\nabla\tilde f(X)\).

Application. Subspace clustering, robust PCA, dimensionality reduction, Grassmannian deep networks.
Oblique manifold \(\operatorname{OB}(m,n)\)

\[ \operatorname{OB}(m,n) = \{ X \in \mathbb{R}^{m\times n} : \|X_{i,:}\|_2 = 1,\ i=1,\dots,m \}, \] i.e. each row has unit Euclidean norm. No orthogonality between rows.

Geometry: Product of \(m\) unit spheres \((S^{n-1})^m\). Tangent space at \(X\): \(T_X\operatorname{OB} = \{ \Delta : X_{i,:}^\top \Delta_{i,:} = 0,\ \forall i \}\).

Retraction: Row‑wise normalisation: \(\operatorname{Retr}_X(\Delta)_{i,:} = (X_{i,:}+\Delta_{i,:}) / \|X_{i,:}+\Delta_{i,:}\|_2\).

Geodesic distance: \(d_{\text{OB}}(X,Y) = \sqrt{\sum_{i=1}^m \arccos^2(\langle X_{i,:}, Y_{i,:}\rangle)}\).

Application. Weight normalization, simplex constraints (after softmax), independent component analysis.
Orthogonal group \(O(n)\) and \(SO(n)\)

\[ O(n) = \{ Q \in \mathbb{R}^{n\times n} : Q^\top Q = I_n \},\quad SO(n) = \{ Q \in O(n) : \det Q = +1 \}. \]

Tangent space at \(Q\): \(T_Q O(n) = \{ Q \Omega : \Omega^\top = -\Omega \}\) (skew‑symmetric matrices).

Exponential map: \(\exp_Q(\Delta) = Q \exp(Q^\top \Delta)\), where \(\exp\) is the matrix exponential.

Cayley retraction: \(\operatorname{Cayley}_Q(\Delta) = Q (I - A)^{-1}(I + A)\) with \(A = Q^\top\Delta\) skew‑symmetric (cheaper than exponential).

Application. Orthogonal weight matrices in RNNs, whitening layers, equivariant networks.
Fixed‑rank manifold \(\mathcal{M}(m,n,r)\)

\[ \mathcal{M}(m,n,r) = \{ X \in \mathbb{R}^{m\times n} : \operatorname{rank}(X) = r \},\quad 0 < r \le \min(m,n). \]

Tangent space at \(X = U\Sigma V^\top\) (compact SVD): \[ T_X\mathcal{M} = \{ U M V^\top + U_\perp A V^\top + U B V_\perp^\top : M \in \mathbb{R}^{r\times r},\ A\in\mathbb{R}^{(m-r)\times r},\ B\in\mathbb{R}^{r\times (n-r)} \}. \]

Retraction (SVD truncation): Let \(X_+ = X + \Delta\), compute its compact SVD \(U_+\Sigma_+V_+^\top\) and keep only the largest \(r\) singular values (set the rest to zero).

Bures‑Wasserstein distance (for PSD case): \(d_{\text{BW}}^2(X,Y) = \operatorname{tr}(X)+\operatorname{tr}(Y)-2\operatorname{tr}\bigl((X^{1/2}YX^{1/2})^{1/2}\bigr)\).

Application. Low‑rank matrix completion, LoRA (Low‑Rank Adaptation) in LLMs, reduced‑order modeling.
Flag manifold \(\operatorname{Fl}(d_1,\dots,d_k,n)\)

\[ \operatorname{Fl}(d_1,\dots,d_k,n) = \{ (V_1,\dots,V_k) : V_1 \subset V_2 \subset \cdots \subset V_k = \mathbb{R}^n,\ \dim V_i = d_i \}. \]

Representation: Choose an orthonormal matrix \(X \in O(n)\) such that the first \(d_1\) columns span \(V_1\), the next \(d_2-d_1\) columns extend to \(V_2\), etc. Then \(\operatorname{Fl} \cong O(n) / (O(d_1)\times O(d_2-d_1)\times\cdots\times O(n-d_k))\).

Tangent space: Matrices with block structure determined by the flag (certain blocks are skew‑symmetric, others arbitrary).

Distance (chordal average): \(d_{\text{Fl}}^2 = \sum_{i=1}^k \|P_{V_i} - P_{W_i}\|_F^2\), where \(P_{V_i}\) is the orthogonal projector onto \(V_i\).

Application. Hierarchical subspace learning, robust PCA with nested structure, multi‑scale data analysis.
Positive semidefinite cone \(\mathcal{S}_+^n\) (interior)

\[ \mathcal{S}_+^n = \{ X \in \mathbb{R}^{n\times n} : X = X^\top,\ X \succeq 0 \}. \] The interior \(\mathcal{S}_{++}^n\) is an open convex cone, a smooth manifold of dimension \(n(n+1)/2\).

Log‑Euclidean metric: \(g_X(\Delta_1,\Delta_2) = \langle \log'(X)[\Delta_1], \log'(X)[\Delta_2] \rangle\) leading to distance \(d_{\text{LE}}(X,Y) = \|\log(X) - \log(Y)\|_F\).

Affine‑invariant metric: \(g_X(\Delta_1,\Delta_2) = \operatorname{tr}(X^{-1}\Delta_1 X^{-1}\Delta_2)\). Geodesic: \(X^{1/2} (X^{-1/2} Y X^{-1/2})^t X^{1/2}\).

Application. Covariance estimation, metric learning, SPD neural networks.
Fixed‑rank PSD manifold

\[ \mathcal{P}(n,r) = \{ X \in \mathbb{R}^{n\times n} : X = X^\top,\ X \succeq 0,\ \operatorname{rank}(X) = r \}. \] This is a smooth submanifold of \(\mathcal{S}_+^n\) (the boundary of the cone).

Factor representation: \(X = LL^\top\) with \(L \in \mathbb{R}^{n\times r}\). This representation has a gauge freedom: \(L \sim LQ\) for any \(Q \in O(r)\).

Bures‑Wasserstein gradient: For a function \(f(X)\), the Riemannian gradient in factor coordinates is \(\nabla_{L} f = 2 \nabla f(X) L\).

Application. Low‑rank covariance estimation, matrix completion with PSD constraints, variational inference.
Finsler manifold induced by the spectral norm

Let \(M = \mathbb{R}^{m\times n}\) (or a submanifold). Define the Finsler metric \(F(W, \Delta) = \|\Delta\|_2\) (the spectral norm), which is independent of the base point. This is a symmetric, non‑Riemannian norm because its unit ball (matrices with \(\|\Delta\|_2 \le 1\)) is not an ellipsoid (except for \(1\times 1\) matrices).

Steepest descent direction: For a gradient \(G = \nabla f(W)\), the direction that minimises \(\langle G, D \rangle\) subject to \(\|D\|_2 = 1\) is \(D = -U V^\top\), where \(U,V\) are the top left and right singular vectors of \(G\).

Muon optimizer: Update \(W \leftarrow W - \eta \, U V^\top\).

Manifold Muon: When restricting to a submanifold (e.g., Stiefel), project the Euclidean gradient onto the tangent space, compute its top singular vectors, project that direction back to the tangent space, then retract.

Application. Training neural networks with spectral‑norm constraints, GNP layers, Muon and Manifold Muon.

On a manifold, the choice of distance metric determines the geometry of the optimization landscape. The most fundamental is the Riemannian geodesic distance, defined as the infimum of lengths of all smooth curves connecting two points. For matrix manifolds, simpler alternatives are often used: the chordal distance on Grassmannians (based on projection matrices), the Procrustes distance on Stiefel (aligning bases via orthogonal transformations), and the Bures–Wasserstein distance for fixed‑rank positive semidefinite matrices. Optimizers on manifolds replace the Euclidean gradient with its Riemannian gradient (projection onto the tangent space) and use a retraction to map updates back to the manifold. Common optimizers include Riemannian SGD, Riemannian Adam, Cayley SGD (for orthogonal groups), Muon (steepest descent under the spectral norm, a Finsler geometry), Manifold Muon (extended to Stiefel), and the landing algorithm (soft constraint penalty). The table below lists these distances and optimizers alongside their associated manifolds.

Spherical Manifold – Definition & Hyperspherical Descent

Definition: The sphere (hypersphere) is the set of unit‑norm vectors in \(\mathbb{R}^d\):

\[ S^{d-1} = \{ w \in \mathbb{R}^d \mid \|w\|_2 = 1 \}. \]

Tangent space at \(w\): \( T_w S^{d-1} = \{ v \in \mathbb{R}^d \mid v^\top w = 0 \} \).
Riemannian metric: \( g_w(v_1, v_2) = v_1^\top v_2 \).
Geodesic distance: \( d(w_1, w_2) = \arccos(w_1^\top w_2) \).
Retraction (normalisation): \( \operatorname{Retr}_w(v) = \frac{w + v}{\|w + v\|_2} \).

Optimization problem: Minimise loss \(f(w)\) subject to \(w \in S^{d-1}\). Given Euclidean gradient \(g = \nabla f(w)\), find update \(a \in T_w S^{d-1}\) that:

  1. Minimises \(a^\top g\) (linearised loss change),
  2. Satisfies \(\|a\|_2 = \eta\) (learning rate),
  3. Lies in tangent space: \(a^\top w = 0\).

Lagrange multiplier solution yields:

\[ a_{\text{opt}} = -\eta \; \frac{g - (w^\top g) w}{\|g - (w^\top g) w\|_2}. \]

Step‑by‑step hyperspherical descent algorithm:

  1. Initialise \(w = w_0 / \|w_0\|_2\).
  2. Compute Euclidean gradient \(g = \nabla f(w)\).
  3. Project \(\tilde{g} = g - (w^\top g) w\).
  4. If \(\|\tilde{g}\|_2 = 0\), stop (stationary point).
  5. Compute update direction \(a = -\eta \cdot \tilde{g} / \|\tilde{g}\|_2\).
  6. Tentative update \(w_{\text{temp}} = w + a\).
  7. Retract \(w \leftarrow w_{\text{temp}} / \|w_{\text{temp}}\|_2\) (equivalent to \(w \leftarrow \frac{w - \eta \tilde{g}/\|\tilde{g}\|_2}{\sqrt{1+\eta^2}}\)).
  8. Repeat from step 2 until convergence.

Python implementation:

import numpy as np

def hyperspherical_descent(f, grad_f, w0, eta, num_iters):
    w = w0 / np.linalg.norm(w0)
    for _ in range(num_iters):
        g = grad_f(w)
        w_dot_g = np.dot(w, g)
        g_tilde = g - w_dot_g * w
        norm_g_tilde = np.linalg.norm(g_tilde)
        if norm_g_tilde == 0:
            break
        a = -eta * (g_tilde / norm_g_tilde)
        w = (w + a) / np.linalg.norm(w + a)
    return w

Remarks: \(\eta\) is the Euclidean step length in the tangent space. For small \(\eta\), the retraction approximates Riemannian gradient descent. Convergence to a stationary point holds under standard Lipschitz gradient assumptions. The normalisation retraction is the simplest and most common.


Analytical Derivation – Optimal Update on the Hypersphere

We derive the optimal update from first principles using Lagrange multipliers, exactly as done for the Stiefel manifold in the Manifold Muon text.

Problem setup. Current point \(w \in S^{n-1} \subset \mathbb{R}^n\) with \(\|w\|_2 = 1\); Euclidean gradient \(g = \nabla f(w) \in \mathbb{R}^n\); learning rate \(\eta > 0\). We seek an update vector \(a \in \mathbb{R}^n\) that minimises the linearised loss change \(a^\top g\) subject to:

\[ \text{(i)}\; \|a\|_2 = \eta, \qquad \text{(ii)}\; a^\top w = 0. \]

1. Lagrangian formulation. Introduce multipliers \(\lambda\) (norm constraint) and \(\mu\) (orthogonality):

\[ \mathcal{L}(a,\lambda,\mu) = a^\top g + \frac{\lambda}{2}(a^\top a - \eta^2) + \mu (a^\top w). \]

Derivative w.r.t. \(a\): \(g + \lambda a + \mu w = 0 \;\Rightarrow\; a = -\frac{1}{\lambda}(g + \mu w).\)

2. Enforce constraints. From \(a^\top w = 0\):

\[ -\frac{1}{\lambda}(g^\top w + \mu (w^\top w)) = 0 \;\Rightarrow\; g^\top w + \mu = 0 \;\Rightarrow\; \mu = -g^\top w. \]

Thus \(a = -\frac{1}{\lambda}(g - (g^\top w) w)\). Define the tangent component:

\[ \tilde{g} = g - (g^\top w) w \in T_w S^{n-1}. \]

Now enforce \(\|a\|_2 = \eta\): \(\|-\frac{1}{\lambda} \tilde{g}\|_2 = \frac{\|\tilde{g}\|_2}{|\lambda|} = \eta \;\Rightarrow\; |\lambda| = \frac{\|\tilde{g}\|_2}{\eta}.\)
Choose \(\lambda > 0\) (descent direction) to obtain:

\[ a_{\text{opt}} = -\eta \,\frac{\tilde{g}}{\|\tilde{g}\|_2}. \]

3. Retraction (radial projection). After the update \(w' = w + a_{\text{opt}}\), the point leaves the sphere. Because \(a_{\text{opt}} \perp w\):

\[ \|w + a_{\text{opt}}\|_2^2 = \|w\|_2^2 + \|a_{\text{opt}}\|_2^2 = 1 + \eta^2. \]

Hence the normalised (retracted) point is

\[ w_{\text{new}} = \frac{w + a_{\text{opt}}}{\sqrt{1+\eta^2}}. \]

4. Full update rule. Substituting \(a_{\text{opt}}\):

\[ w_{\text{new}} = \frac{w - \eta \,\frac{\tilde{g}}{\|\tilde{g}\|_2}}{\sqrt{1+\eta^2}}, \qquad \tilde{g} = g - (g^\top w) w. \]

Equivalently, in a single line:

\[ \boxed{w \;\leftarrow\; \frac{w - \eta\,\frac{g - (w^\top g) w}{\|g - (w^\top g) w\|_2}}{\sqrt{1+\eta^2}}} \]

5. Comparison with Stiefel (Manifold Muon).

AspectSphere \(S^{n-1}\)Stiefel \(\operatorname{St}(m,n)\)
Constraint\(w^\top w = 1\)\(W^\top W = I_n\)
Tangent condition\(a^\top w = 0\)\(A^\top W + W^\top A = 0\)
Norm usedEuclidean (\(\ell_2\))Spectral norm
Optimal update direction\(-\eta \, \tilde{g} / \|\tilde{g}\|_2\)\(-\eta \, \operatorname{msign}(G + 2W(\Lambda+\Lambda^\top))\) (dual ascent)
Retraction\(w \leftarrow (w + a)/\sqrt{1+\eta^2}\)\(W \leftarrow \operatorname{msign}(W)\)

The spherical case is simpler because the tangent space is a linear subspace and the norm constraint is isotropic. The Stiefel case requires solving a dual problem (\(\Lambda\) ascent) due to the non‑trivial coupling of columns via the spectral norm.

6. Algorithm summary (hyperspherical descent).

Input: current \(w\) on sphere, gradient \(g\), learning rate \(\eta\).
Compute tangent gradient: \(\tilde{g} = g - (w^\top g) w\).
If \(\tilde{g} = 0\): stationary point → skip update.
Else set \(a = -\eta \, \tilde{g} / \|\tilde{g}\|_2\) and update \(w \leftarrow (w + a) / \sqrt{1+\eta^2}\).
Optionally renormalise for numerical stability.



CONSIDER SUPPORTING ME click or scan


Click here to contribute
Scan here

Scan to support



Some example manifold shapes

Oblique Manifold

Stiefel Manifold

📚 References & Further Reading

Modular Manifolds blog
Thinking Machines AI blog discussing weight constraints, Lipschitz robustness, Stiefel‑constrained Muon optimizer, and the notion of modular manifolds for scalable training.
Lipschitz Guarantees & Manifolds for Neural Networks HAL thesis
⚠️ Access restricted (institutional login may be required). The thesis studies manifold constraints for Lipschitz properties in deep learning.
mHC: Manifold-Constrained Hyper‑Connections DeepSeek / arXiv 2026
DeepSeek’s manifold‑constrained hyper‑connections (mHC) project residual mapping matrices onto a Stiefel manifold, ensuring gradient stability. Scales to 27B parameters with ~6.7% overhead. Tangent condition: \( X^T Z + Z^T X = 0 \).
📌 Related: The Stiefel manifold appears in both mHC and modular Muon. The spherical case is a special subcase.

Author & Navigation

EM
Mukul namagiri

AI guy

Definition 8 – Layer-wise composition bound (AutoLip)
Let the neural network be a composition of layers: \(f = f_D \circ \cdots \circ f_1\). If each layer \(f_d\) is \(L_d\)-Lipschitz with respect to the \(\ell_2\)-norm, then the overall network satisfies \[ \forall x,y \in \mathbb{R}^m, \quad \|f(x)-f(y)\|_2 \le \left( \prod_{d=1}^D L_d \right) \|x-y\|_2. \] Hence an upper bound on the global Lipschitz constant is \[ \widehat{L}_{\text{product}} = \prod_{d=1}^D L_d. \] For an affine layer \(f_d(z) = W_d z + b_d\), the minimal Lipschitz constant is the spectral norm: \[ L_d = \|W_d\|_2 = \sqrt{\lambda_{\max}(W_d^{\mathsf{T}}W_d)}. \] For standard activation functions (ReLU, sigmoid, tanh), \(L_d = 1\) when they are 1-Lipschitz. The AutoLip algorithm computes these per‑layer bounds via automatic differentiation and backpropagates them to obtain a global upper bound.
Definition 9 – Linear relaxation (Decomon / auto‑LiRPA)
Linear relaxation methods (e.g., Decomon, auto‑LiRPA) compute a local Lipschitz constant over a compact input region \(\mathcal{C} \subset \mathbb{R}^m\) (typically an \(\ell_\infty\)-ball). For a neuron with pre‑activation \(z\) and activation \(\sigma\), the method replaces \(\sigma(z)\) by linear lower and upper bounding functions: \[ \alpha z + \beta \le \sigma(z) \le \gamma z + \delta \quad \forall z \in [z_{\min}, z_{\max}], \] where the coefficients \(\alpha,\beta,\gamma,\delta\) depend on the input bounds. Propagating such linear relaxations through the network yields convex bounds on the output difference: \[ \|f(x)-f(y)\|_2 \le \widehat{L}_{\text{lin}} \|x-y\|_2 \quad \forall x,y \in \mathcal{C}. \] The computed \(\widehat{L}_{\text{lin}}\) is an upper bound on the local Lipschitz constant over \(\mathcal{C}\). These methods are often faster than global SDP approaches and can be scaled to larger networks.
Definition 10 – LipSDP (semidefinite programming)
The LipSDP framework (Fazlyab et al., 2019) formulates the problem of estimating the global Lipschitz constant as a semidefinite program (SDP). For a network \(f(x) = W_D \,\sigma(\,\cdots W_1 x + b_1 \cdots) + b_D\) with layer‑wise weight matrices \(W_i\) and activation \(\sigma\) satisfying a quadratic constraint \[ \begin{bmatrix} \sigma(z) - \sigma(y) \\ z - y \end{bmatrix}^{\mathsf{T}} \! M \begin{bmatrix} \sigma(z) - \sigma(y) \\ z - y \end{bmatrix} \ge 0 \quad \forall z,y \in \mathbb{R}, \] the Lipschitz constant \(L\) is the smallest number such that \[ \|f(x)-f(y)\|_2^2 \le L^2 \|x-y\|_2^2 \quad \forall x,y. \] This inequality is equivalent to a matrix inequality \[ \Delta^{\mathsf{T}} \begin{bmatrix} W_D^{\mathsf{T}}W_D & 0 \\ 0 & -L^2 I \end{bmatrix} \Delta \le 0 \quad \forall \Delta \in \mathbb{R}^{n_1+\cdots+n_D}, \] which can be relaxed to an SDP. The optimal value of the SDP yields an upper bound: \[ \widehat{L}_{\text{SDP}} \ge L. \] LipSDP is expressive enough to handle many activations (ReLU, tanh, GroupSort) by choosing appropriate symmetric matrices \(M\).
Definition 11 – Polyhedron propagation (ReLU networks)
For networks with ReLU activations, an exact Lipschitz bound over a convex input polytope \(\mathcal{P}_{\text{in}}\) can be obtained by forward propagation of polyhedrons. Each ReLU neuron splits the input space into two regions (active or inactive). The set of all possible activation patterns defines a partition. For a fixed pattern, the network becomes an affine map: \[ f(x) = A_{\text{pattern}} x + b_{\text{pattern}}. \] The Lipschitz constant over that pattern is \(\|A_{\text{pattern}}\|_2\). The global Lipschitz constant over \(\mathcal{P}_{\text{in}}\) is the supremum over all patterns reachable from \(\mathcal{P}_{\text{in}}\): \[ \widehat{L}_{\text{poly}} = \max_{\text{patterns reachable}} \|A_{\text{pattern}}\|_2. \] Computing this exactly is combinatorial, but efficient algorithms (e.g., Weng et al., 2018a) propagate convex polyhedra layer‑by‑layer, maintaining outer approximations that yield valid upper bounds.
Definition 12 – LiPopt (difference‑of‑convex)
LiPopt (Latorre et al., 2019) formulates the Lipschitz constant estimation as a difference‑of‑convex programming problem. For a network with ReLU activations, the function can be expressed as a difference of two convex functions: \[ f(x) = \varphi(x) - \psi(x), \] where \(\varphi\) and \(\psi\) are convex. The Lipschitz constant is then bounded by the maximum norm of the subgradients: \[ L \le \sup_x \max\bigl\{ \|\partial \varphi(x)\|_2,\; \|\partial \psi(x)\|_2 \bigr\}. \] This leads to a sequence of convex optimization problems whose optimal values converge to an upper bound. LiPopt typically provides tighter bounds than simple product rules but is more computationally expensive.