Orthogonal gradient updates have emerged as a promising direction in optimization for machine learning.However, traditional approaches such as SVD/QR decompo- sition incur prohibitive computational costs of O(n3) and underperform compared to well-tuned SGD with momentum,since momentum is applied only after strict orthogonalization. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and producing semi-orthogonal matrices via Newton–Schulz iterations, reducing complexity to O(n2). Nevertheless, quadratic costs remain a bottleneck. In this work, we study the semi-orthogonal properties of momentum-based up- dates and develop a method to bound momentum updates under a spectral-norm trust region, preserving directional information without requiring explicit semi- orthogonalization. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, while still preserving structural alignment and reconditioning ill-posed updates. Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton–Schulz methods. We further introduce a hybrid variant (Hybrid-AuON) that applies a single Newton–Schulz iteration. Experiments across vision and language benchmarks show that AuON and its hybrid variant achieve performance comparable to state-of-the-art optimiz- ers. Code: github.com/ryyzn9/AuON.
“If you want to achieve extraordinary progress in AI, you should enhance the optimizer, as it fundamentally determines how models learn.”
Optimization in deep neural networks remains a central challenge, particularly due to the ill-conditioning of gradient and momentum updates. Empirically, these updates often exhibit a high condition number, with most of the energy concentrated in a few dominant directions. In practical terms, the update vectors are nearly low-rank: a handful of directions dictate the optimization trajectory while many potentially informative directions may be suppressed. This imbalance reminds us of a squashed ball that can only roll efficiently along a single axis, ignoring other pathways that may be equally important for generalization and representation learning. One solution is to make all update directions unit length; recent work has proposed orthogonalization of gradients and momentum updates to achieve this property.
By orthogonalizing an update matrix, we effectively discard the scaling information encoded in the singular values and modify each direction to enforce perpendicularity, redistributing the update length into unit vectors along different directions. In this sense, the resulting update behaves as a unit-norm update in the spectral domain, emphasizing the geometric structure of the optimization landscape rather than the raw gradient magnitudes. In simple terms, orthogonalization amplifies “rare directions” with small magnitude in the update but which are nevertheless important for learning. This perspective highlights how orthogonalization can prioritize exploration across all relevant directions, mitigating the dominance of a few high-energy components and facilitating more balanced learning dynamics.
Tuddenham et al. (2022) proposed an approach for neural network optimization in which the gradient is first orthogonalized via singular value decomposition (SVD), followed by the application of momentum, and then the resulting momentum term is used as the update. They refer to this method as Orthogonal-SGDM. In their experiments, they observed that even in the best-performing configuration, Orthogonal-SGDM was outperformed by a well-tuned standard SGD with momentum. This is because applying momentum after strict orthogonalization damages the momentum mechanism: orthogonalizing gradients before accumulation prevents momentum from effectively reducing variance and maintaining beneficial directional information. Moreover, strict orthogonality erases singular-value magnitudes and over-constrains the step, collapsing its singular-value structure to an isometry. In effect, the update becomes a spectral-norm–constrained move that discards useful magnitude information, wiping out correlations between update directions. Making all updates unit length may also increase harmful alignment effects.
Recent advances, such as Muon (Jordan, 2024), improve efficiency and performance by producing a semi-orthogonal matrix using Newton–Schulz iterations rather than a full orthogonal matrix using SVD, and by reordering the momentum update to occur before semi-orthogonalization. This reduces complexity to \(\mathcal{O}(n^2)\), but quadratic costs still remain a bottleneck.
In this paper, we focus on developing an alternative approach to bound updates with high condition numbers under a unit-norm constraint. Our goal is to achieve strong performance with \(\mathcal{O}(n)\) time complexity, without compromising efficiency or speed. Empirically, we find that normalization followed by a hyperbolic-cosine scaling transformation yields promising results.
By orthogonalizing an update matrix \(G \in \mathbb{R}^{m \times n}\) with singular value decomposition
the update is replaced by its orthogonal polar factor:
This satisfies
thereby discarding the scaling information carried by the singular values \(\Sigma\) while preserving the directional subspaces encoded by the left and right singular vectors \(U\) and \(V\).
In this sense, the resulting update behaves as unit-norm in the spectral domain—
with a flat singular spectrum—emphasizing the geometric structure of the optimization landscape rather than the raw gradient magnitudes.
Intuitively, this equalizes per-direction gain: directions that originally had small singular values (“rare directions”) are relatively amplified while dominant directions are relatively attenuated, promoting exploration across all relevant directions and mitigating the dominance of a few high-energy modes.
In practice, orientation and step size can be decoupled by using
so that scale is controlled externally while orthogonalization enforces well-conditioned, balanced updates—yielding more stable and equitable learning dynamics compared to conventional gradient-descent steps.
Given \(G \in \mathbb{R}^{m \times n}\) with singular value decomposition
strict orthogonalization replaces \(G\) by its polar/Stiefel projection
collapsing the singular spectrum to \(\sigma_i(Q) = 1\) on the update subspace and making \(Q\) an isometry with
i.e., the Frobenius-nearest semi-orthogonal matrix that removes the amplitude information in \(\Sigma\).
In the singular basis,
becomes
Geometrically, for Muon’s RMS-to-RMS operator norm, we have
which is the linear minimization oracle (LMO) of a conditional-gradient step. Hence, the singular values are flattened; by contrast, on the standard spectral-norm ball, the LMO yields the rank-1 solution \(u_{1} v_{1}^{\top}\).
To avoid overconstraint, semi-orthogonal schemes such as Muon orthogonalize only the momentum \(M_t\) to
and decouple scale via an RMS-to-RMS factor \(\alpha\), giving
In practice, \(Q_t Q_t^{\top}\) is computed efficiently via a low-order Newton–Schulz iteration, and \(\alpha\) is chosen to match update RMS across shapes. Semi-orthogonalization stabilizes training by bounding spectral energy, equalizing directional gains, preventing overshoot, and enabling larger learning rates by decoupling orientation from scale.
Recent advances demonstrate that orthogonalized momentum in deep learning optimizers, particularly the Muon optimizer, admits a principled interpretation as the solution to a non-Euclidean trust-region subproblem under the spectral norm constraint. The core update rule can be formulated as
where \(\mathrm{Orth}(\cdot)\) denotes the SVD-based orthogonalization operator that computes
yielding the steepest descent direction under the spectral norm metric.
Momentum Integration.
The momentum component follows the exponential moving average:
where \(g(x_k;\xi_k)\) represents an unbiased stochastic gradient estimate. The orthogonalized update then solves the trust-region subproblem
This formulation explicitly constrains parameter updates within a trust region while ensuring the search direction maintains unit spectral norm.
Theoretical Advantages.
The orthogonalization-first approach provides superior variance reduction compared to alternative momentum–orthogonalization orderings. By applying orthogonalization to the momentum vector before the parameter update, the method preserves accumulated directional information while eliminating scale-dependent instabilities. This design has theoretical convergence guarantees and empirical improvements in training stability across diverse architectures.
We hypothesize that forcing all update directions to unit length can be problematic, as not all directions contribute equally to optimization progress—some may be harmful (having a negative impact) or irrelevant to loss reduction. Our goal is to develop an alternative method that removes the harmful directions or alignments and preserves the beneficial properties of near semi-orthogonalization while selectively scaling directions under unit-norm: decrease the scales of rare update directions relative to dominant ones and keep them all under a unit-norm trust region. One solution is to apply a temperature-scaled softmax update matrix, followed by L2-renormalization, to bound the step under a trust region. But computing softmax may be problematic as it introduces computational bottlenecks, and it does not preserve the semi-orthogonal property that is needed for an optimizer like Muon.
We empirically find that normalization with hyperbolic functions (\(\cosh\)) helps us achieve a spectral-norm trust region and helps preserve near semi-orthogonal properties, yielding more stable and equitable learning dynamics compared to conventional gradient-descent steps. It stabilizes training by bounding spectral energy into a unit vector and equalizing directional gains, preventing overshoot along sharp curvature, reducing oscillations, and enabling larger learning rates by decoupling orientation from scale.
Our main goal is to keep all the updated directions under unit spectral norm and remove the harmful directions. We empirically find that dividing the update matrix by the RMS magnitude of \(\cosh(\cdot)\) bounds dominant update directions and preserves near semi-orthogonal-like properties.
where
For large \(|z|\), \(\cosh(z)\) grows exponentially, while for small \(z\),
Thus, \(\cosh\) magnifies meaningful deviations while remaining symmetric and smooth. This encourages a spread of activations (diversity) without enforcing strict orthogonality.
Effect of the Hyperbolic Cosine RMS Magnitude.
Define \(\mathrm{rms} := \|\cosh(\mathrm{update})\|_{F}/\sqrt{N}\). Because \(\cosh\) is even and rapidly increasing in \(|x|\), heavy tails inflate \(\mathrm{rms}\), which reduces the overall step size when forming \(U := \mathrm{update}/(\mathrm{rms}+10^{-8})\). Crucially, \(\cosh\) is not applied to the propagated vector: \(U\) is a uniform rescaling of \(\mathrm{update}\), so the signs and all relative component ratios are preserved. This yields scale invariance with tail-aware damping, without per-coordinate reweighting.
Layman’s terms. First, fix the raw step’s size; then gauge how “spiky” it is using \(\cosh\); finally, shrink the whole step more if it looks spiky. The direction and proportions stay the same.
Near–Semi–Orthogonality.
Exact orthogonality. A matrix \(W \in \mathbb{R}^{m \times n}\) is orthogonal (semi-orthogonal if \(m \neq n\)) when:
Method equations. Let \(G \in \mathbb{R}^{m \times n}\), \(N = mn\). Define:
Immediate implications:
Norms and “balanced sphere.”
Thus, there is no unit-L2 or unit-RMS constraint; the overall step length decreases as the tail-sensitive scalar \(\mathrm{rms}\) increases.
Relation to near semi-orthogonality.
Off-diagonal correlations \(M_{ij} = \langle U_{:i}, U_{:j} \rangle\) are not explicitly zeroed, so the mapping promotes approximate isotropy rather than exact semi-orthogonality.
Practical implication. The update is scale-invariant and tail-aware: heavy tails trigger stronger shrinkage via \(\mathrm{rms}\), helping prevent blow-ups while preserving direction and internal proportions. When approximate isotropy is desired, pair with a lightweight correlation-suppressing operation.
The hybrid approach includes only one iteration of Newton–Schulz and nonlinear reshaping via hyperbolic cosine RMS scaling. This improves performance with only one iteration.
This achieves near semi-orthogonality and bounds the updates under a spectral-norm trust region with far less computation than 5 iterations of Newton–Schulz in Muon, while maintaining competitive performance compared to AdamW and Muon.
We evaluate our approach using a 4×L4 GPU on the SmolLM-Corpus dataset (500k tokens). The underlying model is a nanoGPT with FlashAttention-2, rotary position embeddings (RoPE), RMSNorm, and SwiGLU activations. For the Small configuration, we use a hidden size of 512, 6 layers, 8 attention heads, and a feed-forward dimension of 1536. Training is conducted for 6000 steps with a global batch size of 128.
We compare AuON, AdamW, Hybrid-AuON, and MuON under similar conditions, with learning rates tuned separately: \(\eta_{\text{adamw}} = 0.003\), \(\eta_{\text{auon}} = 0.055\), \(\eta_{\text{muon}} = 0.01\).
Optimizer | Total Params | Opt. Params | Time (s) | Loss | Acc | PPL |
---|---|---|---|---|---|---|
AuON | 40,901,120 | 15,728,640 | 1919.2 | 0.4305 | 0.8667 | 1.54 |
AdamW | 40,901,120 | 40,901,120 | 1918.9 | 0.0686 | 0.9846 | 1.07 |
Hybrid-AuON | 40,901,120 | 15,728,640 | 2285.4 | 0.0422 | 0.9908 | 1.04 |
MuON | 40,901,120 | 15,728,640 | 2303.6 | 0.0375 | 0.9919 | 1.04 |
We evaluated AdamW and the proposed AuON optimizer on the CIFAR-10 dataset under a reduced-scale training protocol. The dataset was split into 15,000 training, 1,500 validation, and 5,000 test samples. A batch size of 32 was used initially.
Training configuration: 100 epochs, learning rate = \(1 \times 10^{-3}\), Muon LR = 0.055, weight decay = \(1 \times 10^{-4}\), momentum \((\beta_{1}, \beta_{2}) = (0.9, 0.99)\). The network contained 19.90M parameters.
Results (Test Accuracy):
Increasing the batch size from 32 to 128 improves performance:
In this paper, we developed a linear-time optimizer that enforces a unit spectral norm and leverages semi-orthogonal properties to stabilize training without requiring full semi-orthogonalization. Our experiments suggest that AuON and its hybrid variant may suffer from exploding attention logits on large-parameter models. Techniques such as QK-clipping may help mitigate this issue.
Empirically, scaling model parameters increased AuON’s accuracy to 92%, with further improvements on downstream tasks. As future work, we plan to evaluate AuON on larger-scale setups (e.g., NanoGPT on H100 GPUs) to assess its performance in practical training scenarios.
Maity, Dipan (2025). AuON: A Survey For Linear-time Orthogonal Optimizer. Zenodo. DOI: 10.5281/zenodo.17176620
@misc{maity2025auon, author = {Maity, Dipan}, title = {AuON: A Survey For Linear-time Orthogonal Optimizer}, year = {2025}, publisher = {Zenodo}, doi = {10.5281/zenodo.17176620}, url = {https://doi.org/10.5281/zenodo.17176620} }
Tip: paste the BibTeX into your `references.bib` (or equivalent) and cite with \cite{maity2025auon}
.