AuON: A Survey For Linear-time Orthogonal Optimizer

Abstract

Orthogonal gradient updates have emerged as a promising direction in optimization for machine learning.However, traditional approaches such as SVD/QR decompo- sition incur prohibitive computational costs of O(n3) and underperform compared to well-tuned SGD with momentum,since momentum is applied only after strict orthogonalization. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and producing semi-orthogonal matrices via Newton–Schulz iterations, reducing complexity to O(n2). Nevertheless, quadratic costs remain a bottleneck. In this work, we study the semi-orthogonal properties of momentum-based up- dates and develop a method to bound momentum updates under a spectral-norm trust region, preserving directional information without requiring explicit semi- orthogonalization. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, while still preserving structural alignment and reconditioning ill-posed updates. Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton–Schulz methods. We further introduce a hybrid variant (Hybrid-AuON) that applies a single Newton–Schulz iteration. Experiments across vision and language benchmarks show that AuON and its hybrid variant achieve performance comparable to state-of-the-art optimiz- ers. Code: github.com/ryyzn9/AuON.

Introduction

“If you want to achieve extraordinary progress in AI, you should enhance the optimizer, as it fundamentally determines how models learn.”

Optimization in deep neural networks remains a central challenge, particularly due to the ill-conditioning of gradient and momentum updates. Empirically, these updates often exhibit a high condition number, with most of the energy concentrated in a few dominant directions. In practical terms, the update vectors are nearly low-rank: a handful of directions dictate the optimization trajectory while many potentially informative directions may be suppressed. This imbalance reminds us of a squashed ball that can only roll efficiently along a single axis, ignoring other pathways that may be equally important for generalization and representation learning. One solution is to make all update directions unit length; recent work has proposed orthogonalization of gradients and momentum updates to achieve this property.

Visualization of the Newton–Schulz process vs AuON, showing heatmaps over 5 iterations and singular value convergence

Visualization of the Newton–Schulz process (0.5) over 5 iterations, compared with AuON and Hybrid-AuON. The heatmaps (top) show progressive orthogonalization, with \( M M^\top \) converging from a scattered structure (Step 0) to an identity-like diagonal (Step 5). The singular value plot (bottom) illustrates rapid convergence toward \(1.0\), confirming orthogonalization.

By orthogonalizing an update matrix, we effectively discard the scaling information encoded in the singular values and modify each direction to enforce perpendicularity, redistributing the update length into unit vectors along different directions. In this sense, the resulting update behaves as a unit-norm update in the spectral domain, emphasizing the geometric structure of the optimization landscape rather than the raw gradient magnitudes. In simple terms, orthogonalization amplifies “rare directions” with small magnitude in the update but which are nevertheless important for learning. This perspective highlights how orthogonalization can prioritize exploration across all relevant directions, mitigating the dominance of a few high-energy components and facilitating more balanced learning dynamics.

Tuddenham et al. (2022) proposed an approach for neural network optimization in which the gradient is first orthogonalized via singular value decomposition (SVD), followed by the application of momentum, and then the resulting momentum term is used as the update. They refer to this method as Orthogonal-SGDM. In their experiments, they observed that even in the best-performing configuration, Orthogonal-SGDM was outperformed by a well-tuned standard SGD with momentum. This is because applying momentum after strict orthogonalization damages the momentum mechanism: orthogonalizing gradients before accumulation prevents momentum from effectively reducing variance and maintaining beneficial directional information. Moreover, strict orthogonality erases singular-value magnitudes and over-constrains the step, collapsing its singular-value structure to an isometry. In effect, the update becomes a spectral-norm–constrained move that discards useful magnitude information, wiping out correlations between update directions. Making all updates unit length may also increase harmful alignment effects.

Recent advances, such as Muon (Jordan, 2024), improve efficiency and performance by producing a semi-orthogonal matrix using Newton–Schulz iterations rather than a full orthogonal matrix using SVD, and by reordering the momentum update to occur before semi-orthogonalization. This reduces complexity to \(\mathcal{O}(n^2)\), but quadratic costs still remain a bottleneck.

In this paper, we focus on developing an alternative approach to bound updates with high condition numbers under a unit-norm constraint. Our goal is to achieve strong performance with \(\mathcal{O}(n)\) time complexity, without compromising efficiency or speed. Empirically, we find that normalization followed by a hyperbolic-cosine scaling transformation yields promising results.

Preliminaries

Orthogonalization

By orthogonalizing an update matrix \(G \in \mathbb{R}^{m \times n}\) with singular value decomposition

\[ G = U \Sigma V^{\top}, \]

the update is replaced by its orthogonal polar factor:

\[ Q := U V^{\top}. \]

This satisfies

\[ Q^{\top}Q = I_n \quad \text{when } m \geq n, \qquad QQ^{\top} = I_m \quad \text{when } m \leq n, \]

thereby discarding the scaling information carried by the singular values \(\Sigma\) while preserving the directional subspaces encoded by the left and right singular vectors \(U\) and \(V\).

In this sense, the resulting update behaves as unit-norm in the spectral domain—

\[ \|Q\|_{2} = 1 \]

with a flat singular spectrum—emphasizing the geometric structure of the optimization landscape rather than the raw gradient magnitudes.

Intuitively, this equalizes per-direction gain: directions that originally had small singular values (“rare directions”) are relatively amplified while dominant directions are relatively attenuated, promoting exploration across all relevant directions and mitigating the dominance of a few high-energy modes.

In practice, orientation and step size can be decoupled by using

\[ \alpha Q, \qquad \alpha = \frac{\|G\|_{F}}{\sqrt{\mathrm{rank}(G)}}, \]

so that scale is controlled externally while orthogonalization enforces well-conditioned, balanced updates—yielding more stable and equitable learning dynamics compared to conventional gradient-descent steps.

Semi-orthogonalization

Given \(G \in \mathbb{R}^{m \times n}\) with singular value decomposition

\[ G = U \Sigma V^{\top}, \]

strict orthogonalization replaces \(G\) by its polar/Stiefel projection

\[ Q := U V^{\top}, \]

collapsing the singular spectrum to \(\sigma_i(Q) = 1\) on the update subspace and making \(Q\) an isometry with

\[ \|Q\|_{2} = 1, \qquad Q^{\top} Q = I_n \; \; (\text{or } \; QQ^{\top} = I_m), \]

i.e., the Frobenius-nearest semi-orthogonal matrix that removes the amplitude information in \(\Sigma\).

In the singular basis,

\[ G^{\top} G = V \Sigma^{2} V^{\top} \]

becomes

\[ Q^{\top} Q = I, \qquad QQ^{\top} = U I U^{\top} = \Pi_{\mathrm{col}(G)}. \]

Geometrically, for Muon’s RMS-to-RMS operator norm, we have

\[ Q \in \arg\max_{\|X\|_{\mathrm{RMS}\to\mathrm{RMS}} \leq 1} \langle X, G \rangle, \]

which is the linear minimization oracle (LMO) of a conditional-gradient step. Hence, the singular values are flattened; by contrast, on the standard spectral-norm ball, the LMO yields the rank-1 solution \(u_{1} v_{1}^{\top}\).

To avoid overconstraint, semi-orthogonal schemes such as Muon orthogonalize only the momentum \(M_t\) to

\[ Q_t = \mathrm{polar}(M_t), \]

and decouple scale via an RMS-to-RMS factor \(\alpha\), giving

\[ W_{t+1} = (1 - \eta_t \lambda) W_t + \eta_t \, \alpha \, Q_t. \]

In practice, \(Q_t Q_t^{\top}\) is computed efficiently via a low-order Newton–Schulz iteration, and \(\alpha\) is chosen to match update RMS across shapes. Semi-orthogonalization stabilizes training by bounding spectral energy, equalizing directional gains, preventing overshoot, and enabling larger learning rates by decoupling orientation from scale.

Orthogonalized Momentum as a Spectral Trust-Region Method

Recent advances demonstrate that orthogonalized momentum in deep learning optimizers, particularly the Muon optimizer, admits a principled interpretation as the solution to a non-Euclidean trust-region subproblem under the spectral norm constraint. The core update rule can be formulated as

\[ X_{k+1} = X_k - \eta O_k, \quad O_k = \mathrm{Orth}\!\left(\nabla F(X_k)\right), \]

where \(\mathrm{Orth}(\cdot)\) denotes the SVD-based orthogonalization operator that computes

\[ M = U \Sigma V^\top \quad \Longrightarrow \quad \mathrm{Orth}(M) = U V^\top, \]

yielding the steepest descent direction under the spectral norm metric.

Momentum Integration.

The momentum component follows the exponential moving average:

\[ m_{k+1} = (1-\alpha)m_k + \alpha \, g(x_k;\xi_k), \]

where \(g(x_k;\xi_k)\) represents an unbiased stochastic gradient estimate. The orthogonalized update then solves the trust-region subproblem

\[ x_{k+1} = \arg\min_{x} \Big\{ \langle \mathrm{Orth}(m_{k+1}), \, x \rangle : \, \|x - x_k\|_2 \leq \eta \Big\}. \]

This formulation explicitly constrains parameter updates within a trust region while ensuring the search direction maintains unit spectral norm.

Theoretical Advantages.

The orthogonalization-first approach provides superior variance reduction compared to alternative momentum–orthogonalization orderings. By applying orthogonalization to the momentum vector before the parameter update, the method preserves accumulated directional information while eliminating scale-dependent instabilities. This design has theoretical convergence guarantees and empirical improvements in training stability across diverse architectures.

Methods

We hypothesize that forcing all update directions to unit length can be problematic, as not all directions contribute equally to optimization progress—some may be harmful (having a negative impact) or irrelevant to loss reduction. Our goal is to develop an alternative method that removes the harmful directions or alignments and preserves the beneficial properties of near semi-orthogonalization while selectively scaling directions under unit-norm: decrease the scales of rare update directions relative to dominant ones and keep them all under a unit-norm trust region. One solution is to apply a temperature-scaled softmax update matrix, followed by L2-renormalization, to bound the step under a trust region. But computing softmax may be problematic as it introduces computational bottlenecks, and it does not preserve the semi-orthogonal property that is needed for an optimizer like Muon.

We empirically find that normalization with hyperbolic functions (\(\cosh\)) helps us achieve a spectral-norm trust region and helps preserve near semi-orthogonal properties, yielding more stable and equitable learning dynamics compared to conventional gradient-descent steps. It stabilizes training by bounding spectral energy into a unit vector and equalizing directional gains, preventing overshoot along sharp curvature, reducing oscillations, and enabling larger learning rates by decoupling orientation from scale.

Nonlinear reshaping via hyperbolic cosine RMS scaling

Our main goal is to keep all the updated directions under unit spectral norm and remove the harmful directions. We empirically find that dividing the update matrix by the RMS magnitude of \(\cosh(\cdot)\) bounds dominant update directions and preserves near semi-orthogonal-like properties.

\[ X = \frac{G}{\|G\| + 10^{-7}}, \quad \text{update} = X, \quad x = \cosh(\text{update}), \quad \text{rms} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} x_i^2}, \quad G = \frac{\text{update}}{\text{rms} + 10^{-8}} \]

where

\[ \cosh(z) = \frac{e^z + e^{-z}}{2}. \]

For large \(|z|\), \(\cosh(z)\) grows exponentially, while for small \(z\),

\[ \cosh(z) \approx 1 + \frac{z^2}{2}. \]

Thus, \(\cosh\) magnifies meaningful deviations while remaining symmetric and smooth. This encourages a spread of activations (diversity) without enforcing strict orthogonality.

Effect of the Hyperbolic Cosine RMS Magnitude.

Define \(\mathrm{rms} := \|\cosh(\mathrm{update})\|_{F}/\sqrt{N}\). Because \(\cosh\) is even and rapidly increasing in \(|x|\), heavy tails inflate \(\mathrm{rms}\), which reduces the overall step size when forming \(U := \mathrm{update}/(\mathrm{rms}+10^{-8})\). Crucially, \(\cosh\) is not applied to the propagated vector: \(U\) is a uniform rescaling of \(\mathrm{update}\), so the signs and all relative component ratios are preserved. This yields scale invariance with tail-aware damping, without per-coordinate reweighting.

Layman’s terms. First, fix the raw step’s size; then gauge how “spiky” it is using \(\cosh\); finally, shrink the whole step more if it looks spiky. The direction and proportions stay the same.

Near–Semi–Orthogonality.

Exact orthogonality. A matrix \(W \in \mathbb{R}^{m \times n}\) is orthogonal (semi-orthogonal if \(m \neq n\)) when:

\[ W^\top W = I_n \quad \text{or} \quad W W^\top = I_m. \]

Method equations. Let \(G \in \mathbb{R}^{m \times n}\), \(N = mn\). Define:

\[ \mathrm{update} := \frac{G}{\|G\|_{F} + 10^{-7}}, \quad \mathrm{rms} := \frac{\|\cosh(\mathrm{update})\|_{F}}{\sqrt{N}}, \quad U := \frac{\mathrm{update}}{\mathrm{rms} + 10^{-8}}. \]

Immediate implications:

Scale invariance. For any \(c>0\), replacing \(G\) by \(cG\) leaves \(\mathrm{update}\) (and thus \(U\)) unchanged up to \(\varepsilon\)-terms.
Tail-aware global scaling. Heavy tails inflate \(\mathrm{rms}\), reducing the global magnitude of \(U\).
No per-component reweighting. \(U\) is a uniform rescaling of \(\mathrm{update}\); ratios are preserved.

Norms and “balanced sphere.”

\[ \|\mathrm{update}\|_{F} \approx 1, \quad \|U\|_{F} \approx \frac{1}{\mathrm{rms} + 10^{-8}}, \quad \mathrm{RMS}(U) \approx \frac{1}{\sqrt{N}(\mathrm{rms}+10^{-8})}. \]

Thus, there is no unit-L2 or unit-RMS constraint; the overall step length decreases as the tail-sensitive scalar \(\mathrm{rms}\) increases.

Relation to near semi-orthogonality.

\[ M := U^\top U, \quad \operatorname{tr}(M) = \|U\|_F^2, \quad \alpha := \frac{1}{n}\operatorname{tr}(M). \]

Off-diagonal correlations \(M_{ij} = \langle U_{:i}, U_{:j} \rangle\) are not explicitly zeroed, so the mapping promotes approximate isotropy rather than exact semi-orthogonality.

Cross-correlations. Off-diagonals remain scaled copies of \(\mathrm{update}^\top \mathrm{update}\).
Isotropy. Alone, this does not drive \(M\) toward \(\alpha I_n\). Additional correlation-reducing steps may be required (e.g., whitening, penalties).

Practical implication. The update is scale-invariant and tail-aware: heavy tails trigger stronger shrinkage via \(\mathrm{rms}\), helping prevent blow-ups while preserving direction and internal proportions. When approximate isotropy is desired, pair with a lightweight correlation-suppressing operation.

Hybrid Approach

Comparison of computation efficiency of different methods on (n×n) random matrices.

The hybrid approach includes only one iteration of Newton–Schulz and nonlinear reshaping via hyperbolic cosine RMS scaling. This improves performance with only one iteration.

\[ A = X X^\top, \quad B = bA + cA^2, \quad X \leftarrow aX + BX \]

\[ G_{\text{new}} = \frac{\text{update}}{\mathrm{rms} + \delta}, \quad \text{with } \text{update} = X, \quad \mathrm{rms} = \frac{1}{N} \| \cosh(X) \|_F^2. \]

This achieves near semi-orthogonality and bounds the updates under a spectral-norm trust region with far less computation than 5 iterations of Newton–Schulz in Muon, while maintaining competitive performance compared to AdamW and Muon.

Experiments

Language Modeling

We evaluate our approach using a 4×L4 GPU on the SmolLM-Corpus dataset (500k tokens). The underlying model is a nanoGPT with FlashAttention-2, rotary position embeddings (RoPE), RMSNorm, and SwiGLU activations. For the Small configuration, we use a hidden size of 512, 6 layers, 8 attention heads, and a feed-forward dimension of 1536. Training is conducted for 6000 steps with a global batch size of 128.

We compare AuON, AdamW, Hybrid-AuON, and MuON under similar conditions, with learning rates tuned separately: \(\eta_{\text{adamw}} = 0.003\), \(\eta_{\text{auon}} = 0.055\), \(\eta_{\text{muon}} = 0.01\).

Training Results on Tiny (Run 1). All optimizers trained under identical settings.
Optimizer	Total Params	Opt. Params	Time (s)	Loss	Acc	PPL
AuON	40,901,120	15,728,640	1919.2	0.4305	0.8667	1.54
AdamW	40,901,120	40,901,120	1918.9	0.0686	0.9846	1.07
Hybrid-AuON	40,901,120	15,728,640	2285.4	0.0422	0.9908	1.04
MuON	40,901,120	15,728,640	2303.6	0.0375	0.9919	1.04

Vision Task (CIFAR-10)

We evaluated AdamW and the proposed AuON optimizer on the CIFAR-10 dataset under a reduced-scale training protocol. The dataset was split into 15,000 training, 1,500 validation, and 5,000 test samples. A batch size of 32 was used initially.

Training configuration: 100 epochs, learning rate = \(1 \times 10^{-3}\), Muon LR = 0.055, weight decay = \(1 \times 10^{-4}\), momentum \((\beta_{1}, \beta_{2}) = (0.9, 0.99)\). The network contained 19.90M parameters.

Results (Test Accuracy):

AdamW: 76.0%
AuON: 73.3%
\(\Delta = -2.7\) percentage points

Training and validation curves on CIFAR-10 with AdamW and AuON optimizers. (Left) AdamW's loss decreases steadily. (Right) Accuracy shows AdamW converging to ~76% while AuON plateaus around 73%.

Increasing the batch size from 32 to 128 improves performance:

AdamW: 76.32%
AuON: 76.22%
\(\Delta = -0.10\) percentage points

Conclusion

In this paper, we developed a linear-time optimizer that enforces a unit spectral norm and leverages semi-orthogonal properties to stabilize training without requiring full semi-orthogonalization. Our experiments suggest that AuON and its hybrid variant may suffer from exploding attention logits on large-parameter models. Techniques such as QK-clipping may help mitigate this issue.

Empirically, scaling model parameters increased AuON’s accuracy to 92%, with further improvements on downstream tasks. As future work, we plan to evaluate AuON on larger-scale setups (e.g., NanoGPT on H100 GPUs) to assess its performance in practical training scenarios.

References

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024. Link
Mark Tuddenham, Adam Prügel-Bennett, Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation. arXiv:2202.07052, 2022. arXiv
Minxin Zhang, Yuxuan Liu, Hayden Schaeffer. AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates. arXiv:2509.02981, 2025. arXiv
Dmitry Kovalev. Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization. arXiv:2503.12645, 2025. arXiv
Mark Peletier, André Schlichting. Cosh gradient systems and tilting. Nonlinear Analysis, 231:113094, 2022/2023. DOI
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, Zhilin Yang. Muon is Scalable for LLM Training. arXiv:2502.16982, 2025. arXiv
Nicholas Higham. Matrix Nearness Problems and Applications. 2000. Link
James R. Lee. Von Neumann's Inequality and Unitarily-Invariant Norms. Lecture notes, CSE599I, 2021. Link
Vuk Rosić, Claude. Muon vs AdamW: Learning Rate And Scaling Small LLMs. GitHub, 2025. Repo
Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, Leandro von Werra. SmolLM-Corpus. Hugging Face, 2024. Dataset
Ilya Loshchilov, Frank Hutter. Decoupled Weight Decay Regularization. arXiv:1711.05101, 2019. arXiv
Noam Shazeer. GLU Variants Improve Transformer. arXiv:2002.05202, 2020. arXiv
Shiyu Huang, Yuxin Su, Xuezhe Ma, Noah A. Smith. Root Mean Square Layer Normalization. arXiv:1910.07467, 2019. arXiv
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864, 2023. arXiv
Andrej Karpathy. NanoGPT. GitHub repository, 2022. Repo
Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691, 2023. arXiv
Kimi Team et al. Kimi K2: Open Agentic Intelligence. arXiv:2507.20534, 2025. arXiv

Cite this work

Maity, Dipan (2025). AuON: A Survey For Linear-time Orthogonal Optimizer. Zenodo. DOI: 10.5281/zenodo.17176620

@misc{maity2025auon,
  author       = {Maity, Dipan},
  title        = {AuON: A Survey For Linear-time Orthogonal Optimizer},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17176620},
  url          = {https://doi.org/10.5281/zenodo.17176620}
}

Other citation styles

APA: Maity, D. (2025). AuON: A Survey For Linear-time Orthogonal Optimizer. Zenodo. https://doi.org/10.5281/zenodo.17176620
IEEE: D. Maity, “AuON: A Survey For Linear-time Orthogonal Optimizer,” Zenodo, 2025. DOI: 10.5281/zenodo.17176620.
MLA: Maity, Dipan. AuON: A Survey For Linear-time Orthogonal Optimizer. Zenodo, 2025, https://doi.org/10.5281/zenodo.17176620.

Tip: paste the BibTeX into your `references.bib` (or equivalent) and cite with \cite{maity2025auon}.