Open Access

Gradient Descent for Unbounded Convex Functions on Hadamard Manifolds and Its Applications to Scaling Problems

Hiroshi Hirai
Corresponding Author
Hiroshi Hirai
[email protected]
https://orcid.org/0000-0002-4784-5110
Graduate School of Mathematics, Nagoya University, Nagoya 464-8602, Japan
Search for more papers by this author
,
Keiya Sakabe
Keiya Sakabe
[email protected]
https://orcid.org/0009-0003-8894-4400
Faculty of Physics, Ludwig-Maximilians-Universität München, 80539 Munich, Germany
Search for more papers by this author

Hiroshi Hirai

Corresponding Author

Hiroshi Hirai

[email protected]

https://orcid.org/0000-0002-4784-5110

Graduate School of Mathematics, Nagoya University, Nagoya 464-8602, Japan

Search for more papers by this author

Keiya Sakabe

[email protected]

https://orcid.org/0009-0003-8894-4400

Faculty of Physics, Ludwig-Maximilians-Universität München, 80539 Munich, Germany

Search for more papers by this author

Published Online:24 Apr 2026https://doi.org/10.1287/moor.2025.0939

Abstract

In this paper, we study the asymptotic behavior of continuous- and discrete-time gradient flows of a “lower-unbounded” convex function f on a Hadamard manifold M, particularly their convergence properties to the boundary $M^{\infty}$ at infinity of M. We establish a duality theorem that the infimum of the gradient-norm $‖ \nabla f (x) ‖$ of f over M is equal to the supremum of the negative of the recession function $f^{\infty}$ of f over the boundary $M^{\infty}$ , provided the infimum is positive. Further, the infimum and the supremum are obtained by the limit of the gradient flow of f. Our results feature convex optimization ingredients of the moment-weight inequality for reductive group actions, and are applied to noncommutative optimization. We show that gradient descent of the Kempf-Ness function for an unstable orbit converges to a destabilizing 1-parameter subgroup in the Hilbert-Mumford criterion, and the associated moment-map sequence converges to the minimum-norm point of the moment polytope. We show further refinements for operator scaling—the left-right action on a matrix tuple $A = (A_{1}, A_{2}, \dots, A_{N})$ . We characterize the gradient-flow limit of operator scaling by a vector-space generalization of the classical Dulmage-Mendelsohn decomposition of a bipartite graph. For a special case of $N = 2$ , we reveal that the limit determines the Kronecker canonical form of a matrix pencil $s A_{1} + A_{2}$ .

Funding: H. Hirai was supported by the Japan Society for the Promotion of Science (KAKENHI [Grants JP21K19759 and JP24K21315]). K. Sakabe was supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation [Grant 556164098]) and the European Research Council (ERC Starting Grant SYMOPTIC [Grant 101040907]).

1. Introduction

In convex optimization, it is typically assumed that the objective function f is bounded below. The performance of a minimization algorithm is evaluated by its convergence behavior to the minimum of f. This paper addresses the convergence behavior of minimization algorithms for a “lower-unbounded” convex function f, that is, $\inf f (x) = - \infty$ . This may look meaningless, because the trajectory $x_{i}$ of an algorithm diverges to infinity, and $f (x_{i})$ goes to $- \infty$ . The meta question of the paper is the following:

What can we gain from such a divergent sequence?

Let us formalize our setting and mention its background. Let M be a Hadamard manifold—a simply connected complete Riemannian manifold with nonpositive sectional curvature. Let $f : M \to R$ be a (twice differentiable) geodesically convex function, that is, f is convex along any geodesic. We consider the following unconstrained convex optimization problem on M:

inf . f (x) s . t . x \in M, where f can be lower - unbounded .

(1.1)

Such a problem setting is significant in the recent progress on operator scaling (Gurvits [27]) and generalizations (see Allen-Zhu et al. [1], Bürgisser et al. [12], Bürgisser et al. [13], Garg and Oliveira [21], Garg et al. [22, 23], and Hirai et al. [34]). In the classical matrix scaling (Sinkhorn [58]), the scalability is equivalent to the boundedness of (1.1) for some convex function f in $R^{n}$ . Further, it is also equivalent to the perfect-matching condition of the associated bipartite graph. Hayashi et al. [29] studied asymptotic behavior of the Sinkhorn algorithm for the unscalable (unbounded) case, and revealed that a combinatorial certificate (Hall blocker) of unscalability can be identified from divergent behavior of the Sinkhorn algorithm. Although a Hall blocker is easily obtained by network-flow algorithms, finding the corresponding certificate (shrunk subspace) for the operator scaling setting is possible but quite difficult (see Hamada and Hirai [28], Ivanyos et al. [37, 38]). Just recently, Franks et al. [19] modified the operator Sinkhorn algorithm—an alternating minimization algorithm for some convex function on the Hadamard manifold of positive definite matrices—to obtain a shrunk subspace in polynomial time, although it is still rather complicated. The matrix and operator scaling problems are generalized to a class of convex optimization involving reductive group actions, called noncommutative optimization (Bürgisser et al. [13]), which asks to minimize the Kempf-Ness function associated with an orbit of the action. This is formulated as a convex optimization problem on a representative class of Hadamard manifolds—symmetric spaces of nonpositive curvature. It is lower-unbounded if and only if the orbit is unstable, where a 1-parameter subgroup (destabilizing 1-PSG) in the Hilbert-Mumford criterion is the unboundedness certificate that generalizes a Hall blocker and a shrunk subspace. As mentioned in Bürgisser et al. [13], it is a great challenge to design polynomial-time algorithms for several noncommutative optimization problems, such as (un)stability determination, moment polytope membership, and orbit-closure intersection, which will bring fruitful applications to broader areas of mathematical sciences. Many of them involve (un)bounded determination of Kempf-Ness functions, though our current knowledge on such problems is limited.

Motivated by these considerations, we study minimization of lower-unbounded convex functions on Hadamard manifolds. Even in the Euclidean setting $M = R^{n}$ , there are few works (see, e.g., Auslender [4] and Obuchowska [50]) on such study. We focus on asymptotic behavior of the simplest algorithm—gradient descent. Accompanying this, we also consider its continuous version—gradient flow, that is, a trajectory produced by the differential equation $\dot{x} (t) = - \nabla f (x (t))$ .

The contributions and organization of this paper are summarized as follows. We begin with a general study of the asymptotic behavior of the gradient flow/descent for an unbounded convex function f on a Hadamard manifold M. As in the Euclidean setting, the recession function (asymptotic slope) $f^{\infty}$ of f (see Hirai [30] and Kapovich et al. [40]) is a basic tool of analyzing unboundedness, which is a function defined on the boundary $M^{\infty}$ at infinity of M . Intuitively, the boundary $M^{\infty}$ is the set of all directions $ξ$ from an arbitrary fixed point $x_{0}$ , and $f^{\infty} (ξ)$ represents the slope of f along the direction $ξ$ at infinity. Then, Hadamard manifold M admits compactification $M \cup M^{\infty}$ , where the resulting topology is called the cone topology. These notions and related manifold terminologies are summarized in Section 2.

We focus on convergence properties, with respect to the cone topology, of the gradient flow/descent for an unbounded convex function f. In Section 3, under a sufficient condition $\inf_{x \in M} ‖ \nabla f (x) ‖ > 0$ of unboundedness, we establish in Theorem 3.1 that the gradient flow $x (t)$ converges to a point of boundary $M^{\infty}$ with provision of the following min-max (inf-sup) relation:

\lim_{t \to \infty} ‖ \nabla f (x (t)) ‖ = \inf_{x \in M} ‖ \nabla f (x) ‖ = \sup_{ξ \in M^{\infty}} - f^{\infty} (ξ) = - f^{\infty} (\lim_{t \to \infty} x (t)) .

(1.2)

The limit $\lim_{t \to \infty} x (t)$ is the unique minimizer of $f^{\infty}$ over $M^{\infty}$ , and is a certificate of unboundedness. Further, we also show in Theorem 3.7 that the same result holds for the sequence $x_{i}$ produced by gradient descent applied to an L-smooth convex function f with step-size $1 / L$ . These are the core results of the paper that drive the subsequent arguments.

Even in the Euclidean setting $M = R^{n}$ , these convergence results on the gradient flow/descent seem new, and bring an interesting ramification (Theorem 3.15): both $\nabla f (x (t))$ and $\nabla f (x_{i})$ converge to the minimum-norm point $p^{*}$ of the gradient space $\bar{\nabla f (R^{n})}$ (which is convex). This means that gradient descent is interpreted as a minimum-norm point algorithm in the gradient space. Other interesting connections to and implications for Hessian Riemannian gradient flow (Alvarez et al. [2]), mirror descent (Nemirovsky and Yudin [49]), and geometric programming are also mentioned.

In Section 4, we present applications. In Section 4.1, we deal with the norm-minimization problem for a reductive group action on a complex vector/projective space. As mentioned, this is the problem of minimizing the Kempf-Ness function $f_{v}$ associated with an orbit of the action. Then, gradient descent is essentially the first-order algorithm in Bürgisser et al. [13]. Applying our results, we show that for the unstable case the trajectory of the first-order algorithm converges, in cone topology, to the unique minimizer of $f_{v}^{\infty}$ , which yields a destabilizing 1-PSG in the Hilbert-Mumford criterion. Further, the spectrum of the moment map (= transported gradient of $f_{v}$ ) along the trajectory converges to the minimum-norm point of the moment polytope $Δ_{v}$ . For the gradient-flow setting, we reveal the connection to the theory of the moment-weight inequality for reductive group actions, developed by Georgoulas et al. [24], building upon the earlier work by Kempf, Kirwan, Mumford, and Ness in geometric invariant theory (GIT) and the recent work by Chen and Sun [15, section 4] in K-stability. Specifically, the weak duality $‖ \nabla f (x) ‖ \geq - f^{\infty} (ξ)$ in (1.2) becomes the moment-weight inequality, and the strong duality via the gradient flow can explain important parts of their theory. It may be fair to say that our results in Section 3 extract and discretize convex optimization ingredients of their theory.

In Section 4.2, we focus on the left-right action $S L_{n} (C) \times S L_{m} (C) ∋ (g, h) \mapsto g A h^{†}$ on a matrix tuple $A = (A_{1}, A_{2}, \dots, A_{N})$ , which corresponds to the operator scaling problem. In this setting, the middle equality in (1.2) is interpreted as a duality theorem for the scalability limitation (Theorem 4.20), which sharpens Gurvits’ characterization in the inf-sup form. We then study the limit of the gradient flow/descent for the Kempf-Ness function $(g, h) \mapsto \log ‖ g A h^{†} ‖$ . Our focus is on the unscalable case, whereas the scalable case was studied in detail by Kwok et al. [45]. We show in Theorems 4.24 and 4.27 that the minimum-norm point of the moment polytope $Δ_{A}$ and the limit of the gradient flow/descent are characterized by a certain simultaneous block-triangularization of $A = (A_{1}, A_{2}, \dots, A_{N})$ , which is a vector-space generalization of the classical Dulmage-Mendelsohn decomposition (DM-decomposition) (Dulmage and Mendelsohn [17]) of a bipartite graph. More specifically, the sequence of (normalized) scaling tuples $g_{k} A h_{k}^{†} / ‖ g_{k} A h_{k}^{†} ‖$ along the gradient descent converges to a block-diagonal matrix modulo the left-right unitary group action, where the block structure is determined by our generalized DM-decomposition. This answers the gradient-descent variant of an open question by Garg and Oliveira [21, section 6] for asking asymptotic behavior of the operator Sinkhorn algorithm for unscalable instances. Finding this block structure itself is significant. We partially eliminate the unitary indeterminacy from $g_{k} A h_{k}^{†}$ , and exploit a convergent sequence to a coarse block-triangular structure (Theorem 4.28). This leads to a new construction of a shrunk subspace (certificate of unscalability) by gradient descent combined with the rounding procedure in Franks et al. [19].

In Section 4.3, for a special case of $N = 2$ , we reveal that our DM-decomposition of $(A_{1}, A_{2})$ coarsens and determines the well-known Kronecker canonical form of a matrix pencil $s A_{1} + A_{2}$ . The Kronecker form plays important roles in systems analysis by a differential-algebraic equation (DAE) $A_{1} \dot{u} (t) + A_{2} u (t) = 0$ . Its computation has been studied for a long time in the literature of numerical computation (see, e.g., Demmel and Kåragström [16] and Van Dooren [59]). Our convergence result (Theorem 4.33) suggests a new iterative method for determining the Kronecker structure, which is based on simple gradient descent and is conceptually different from the existing ones.

These results may be positioned as attempts at detecting, by algorithms in M, hidden structures in the boundary $M^{\infty}$ at infinity, which have been little studied so far. We hope that our attempts lead to more serious studies from a computational complexity perspective. Particularly, it is an important future direction to improve the present convergence-type results to the ones having explicit iteration complexity.

After the submission of this paper, there have been several subsequent developments (Hirai [31, 32], Ohta [51], Sakabe [54], Sakabe et al. [55]).

2. Preliminaries

Let $R$ and $R_{+}$ denote the sets of real and nonnegative real numbers, respectively. We often add to $R$ and $R_{+}$ the infinity elements $\pm \infty$ , where the topology and ordering $\leq$ are extended in the usual way. Let $C$ denote the set of complex numbers $z = x + i y$ , where $\bar{z}$ denotes the complex conjugate $x - i y$ and Rez denotes the real part x. The same notation is used for a complex vector $ζ = u + i v \in C^{n}$ with $u, v \in R^{n}$ as $\bar{ζ} = u - i v$ . For a matrix A over $C$ , let $A^{†}$ denote the transpose conjugate. For sets I and J of row indices and column indices of A, let A[I, J] denote the submatrix of A with row indices in I and column indices in J. For two matrices A, B (of possibly different sizes), let $A \oplus B$ denote the block-diagonal matrix of block-diagonals A, B in order. For a vector $p \in R^{n}$ , let $diag p$ denote the $n \times n$ diagonal matrix with ${(diag p)}_{i i} = p_{i}$ .

The general linear group $G L (n, C)$ and the special linear group $S L (n, C)$ over $C$ are simply denoted by $G L_{n}$ and $S L_{n}$ , respectively. The unitary group $U (n)$ and the special unitary group $S U (n)$ are denoted by $U_{n}$ and $S U_{n}$ , respectively. For a finite-dimensional vector space V over $C$ , let $G L (V)$ denote the group of linear isomorphisms on V.

For a positive integer n, let $[n] ≔ {1, 2, \dots, n}$ . For $X \subseteq [n]$ , let $1_{X} \in R^{n}$ be defined by ${(1_{X})}_{i} = 1$ if $i \in X$ and 0 otherwise, where $1_{[n]}$ is simply written as $1$ .

A sequence ${(x_{i})}_{i = 0, 1, \dots,}$ and function ${(x (t))}_{t \in R_{+}}$ are simply denoted by $x_{i}$ and $x (t)$ , respectively. For a real-valued sequence $a_{i}$ and continuous function $h (t)$ , we will use several times the following:

\underset{i \to \infty}{lim inf} a_{i} \leq \underset{i \to \infty}{lim inf} \frac{1}{i} \sum_{j = 1}^{i} a_{j} \leq \underset{i \to \infty}{lim sup} \frac{1}{i} \sum_{j = 1}^{i} a_{j} \leq \underset{i \to \infty}{lim sup} a_{i},

(2.1)

\underset{t \to \infty}{lim inf} h (t) \leq \underset{t \to \infty}{lim inf} \frac{1}{t} \int_{0}^{t} h (s) d s \leq \underset{t \to \infty}{lim sup} \frac{1}{t} \int_{0}^{t} h (s) d s \leq \underset{t \to \infty}{lim sup} h (t) .

(2.2)

This is a little exercise in calculus. For example, the leftmost $\leq$ in (2.2) follows from this: Suppose that $α ≔ {lim inf}_{t \to \infty} h (t) \in R$ . Then $\forall ϵ > 0$ , $\exists N \geq 0$ , $\forall t \geq N$ , $h (t) \geq α - ϵ$ , and hence, $\forall t \geq N$ , $\frac{1}{t} \int_{0}^{t} h (s) d s \geq \frac{1}{t} \int_{0}^{N} h (s) d s + \frac{t - N}{t} (α - ϵ) \to_{t \to \infty} α - ϵ$ . Because $ϵ$ is arbitrary, we have ${lim inf}_{t \to \infty} \frac{1}{t} \int_{0}^{t} h (s) d s \geq α$ .

2.1. Riemannian Geometry

We will utilize standard terminologies and notation on Riemannian geometry (see, e.g., Sakai [56]). See also a recent book (Boumal [8]) for the optimization perspective. We assume sufficient differentiability for manifolds, functions, maps, and vector/tensor fields on them. Let M be a Riemannian manifold. For $x \in M$ , let $T_{x} = T_{x} (M)$ denote the tangent space of M at x, where $〈 \cdot, \cdot 〉 = {〈 \cdot, \cdot 〉}_{x}$ denotes the Riemannian metric at x and $‖ \cdot ‖ ≔ \sqrt{〈 \cdot, \cdot 〉}$ denotes the associated norm. Let $S_{x} ≔ {u \in T_{x} ∣ ‖ u ‖ = 1}$ and $B_{x} ≔ {u \in T_{x} ∣ ‖ u ‖ \leq 1}$ denote the unit sphere and ball in $T_{x}$ , respectively. The angle $∠ (u, v)$ of two vectors $u, v \in T_{x}$ is defined as $\cos^{- 1} (〈 u, v 〉 / ‖ u ‖ ‖ v ‖)$ . The product $M \times M^{'}$ of two Riemannian manifolds $M, M^{'}$ is viewed as a Riemannian manifold by setting ${〈 (u, u^{'}), (v, v^{'}) 〉}_{(x, x^{'})} ≔ {〈 u, v 〉}_{x} + {〈 u^{'}, v^{'} 〉}_{x^{'}}$ .

For a path $γ : [a, b] \to M$ and $t \in [a, b]$ , let $\dot{γ} (t)$ denote the tangent vector of $γ$ at $T_{γ (t)}$ . The length of the path $γ$ is defined by $\int_{a}^{b} ‖ \dot{γ} (t) ‖ d t$ . The distance $d (x, y)$ between $x, y \in M$ is the infimum of the length of a path connecting x and y. We consider the Levi-Civita connection $\nabla$ associated with the Riemannian metric. The connection $\nabla$ determines the parallel transport $τ_{γ}^{t} : T_{γ (0)} \to T_{γ (t)}$ along any path $γ : [0, b] \to M$ with $t \in [0, b]$ , where $τ_{γ}^{- t} ≔ {(τ_{γ}^{t})}^{- 1}$ . By using the parallel transport, the covariant derivative $\nabla_{u} V$ of a vector field $V = {(V_{x})}_{x \in M}$ by $u \in T_{x}$ is given by $\nabla_{u} V ≔ (d / d t) τ_{γ}^{- t} V_{γ (t)} ∣_{t = 0}$ , where $γ$ is a path with $γ (0) = x$ and $\dot{γ} (0) = u$ .

In this paper, any manifold M is assumed to be complete. That is, the metric space $(M, d)$ is complete. Then, the distance $d (x, y)$ is always attained by a geodesic—a path $γ : [a, b] \to M$ satisfying $\nabla_{\dot{γ} (t)} \dot{γ} = 0$ for $t \in [a, b]$ . By a unit-speed geodesic ray, we mean a geodesic $γ : [0, \infty) \to M$ with $‖ \dot{γ} (0) ‖ = 1$ (and then $‖ \dot{γ} (t) ‖ = 1$ for all t). For $x \in M$ and $u \in T_{x}$ , there is a unique geodesic $γ (t)$ with $γ (0) = x$ and $\dot{γ} (0) = u$ , denoted by $\exp_{x} t u$ . By completeness of M, the map $t \mapsto \exp_{x} t u$ is defined on $R_{+}$ . This gives rise to a surjective map $\exp_{x} : T_{x} \to M$ , called the exponential map.

For a map $φ : M \to N$ , where N is another manifold, let $d φ : T_{x} (M) \to T_{φ (x)} (N)$ denote the differential of $φ$ at $x \in M$ . The differential $d f = d f_{x} : T_{x} \to R$ of a function $f : M \to R$ is given by $d f (u) = (d / d t) f (γ (t)) ∣_{t = 0}$ , where $γ$ is a path with $γ (0) = x$ and $\dot{γ} (0) = u \in T_{x}$ . The gradient $\nabla f (x) \in T_{x}$ of f is then defined via

〈 \nabla f (x), u 〉 ≔ d f (u) (u \in T_{x}) .

The covariant differentiation of the gradient vector field $\nabla f$ gives rise to the Hessian $\nabla^{2} f (x) : T_{x} \to T_{x}$ :

\nabla^{2} f (x) u ≔ \nabla_{u} \nabla f (x) (u \in T_{x}) .

(2.3)

The Hessian is a symmetric operator in the sense that $〈 \nabla^{2} f (x) u, v 〉 = 〈 \nabla^{2} f (x) v, u 〉$ .

2.1.1. Complex Projective Space.

We will consider the complex projective space as a Riemannian manifold. Let V be an n-dimensional vector space over $C$ . The complex projective space $P (V)$ over V is a quotient manifold $V ∖ {0} / \sim$ by the equivalent relation $v \sim v^{'} \Leftrightarrow v = α v^{'}$ $(\exists α \in C ∖ {0})$ . The image of v by $V ∖ {0} \to P (V)$ is denoted by [v]. A Riemannian structure on $P (V)$ is given by the Fubini-Study form as follows. Let $(\cdot, \cdot)$ be a Hermitian inner product on V. Regard V as a 2n-dimensional Euclidean space by the real inner product $R e (\cdot, \cdot)$ . This induces a Riemannian structure on the sphere $S^{2 n - 1} = {v \in V ∣ ‖ v ‖ = r}$ , where we set $r ≔ \sqrt{2}$ . Further, $U_{1} (= U (1))$ acts isometrically on $S^{2 n - 1}$ by scalar multiplication $U_{1} \times S^{2 n - 1} ∋ (e^{i θ}, v) \mapsto e^{i θ} v$ . Then $P (V)$ is viewed as the Riemannian quotient of $S^{2 n - 1}$ with respect to this action. The resulting metric on $P (V)$ is called the Fubini-Study metric. See, for example, Boumal [8, chapter 9] for Riemannian quotient manifolds.

2.2. Hadamard Manifold

A Hadamard manifold M is a simply connected complete Riemannian manifold having nonpositive sectional curvature everywhere (see Sakai [56, V.4]). For any two points in M, a geodesic connecting them is uniquely determined (up to affine rescaling). The exponential map $\exp_{x}$ is a diffeomorphism from $T_{x}$ to M. The parallel transport from $T_{x}$ to $T_{y}$ along the geodesic is simply denoted by $τ_{x \to y}$ .

In this paper, the boundary $M^{\infty}$ at infinity and the cone topology on $M \cup M^{\infty}$ play particularly important roles. See Sakai [56, V.4.2]) for a quick introduction to these notions. Two unit-speed geodesic rays $γ, γ^{'} : R_{+} \to M$ are called asymptotic if $d (γ (t), γ^{'} (t)) < C$ $(t \in R_{+})$ for some constant $C > 0$ . The asymptotic relation is an equivalence relation on the set of all unit-speed geodesic rays. Let $M^{\infty}$ denote the set of all equivalence classes. Let us fix an arbitrary point $x \in M$ . Any unit vector $u \in S_{x}$ defines an asymptotic class of unit-speed geodesic ray $t \mapsto \exp_{x} t u$ . This correspondence is a bijection between $S_{x}$ and $M^{\infty}$ , and induces a topology on $M^{\infty}$ that is isomorphic to the sphere $S_{x}$ . In fact, this topology is independent of the choice of x. Further, the topologies on M and on $M^{\infty}$ are extended to $M \cup M^{\infty}$ as follows. Because $\exp_{x}$ is a diffeomorphism, it holds that $M ≃ (S_{x} \times R_{+}) / \sim_{0}$ , where $\sim_{0}$ is the equivalence relation defined by $(u, r) \sim_{0} (u^{'}, r^{'})$ $\Leftrightarrow$ $(u, r) = (u^{'}, r^{'})$ or $r = r^{'} = 0$ . With $M^{\infty} ≃ S_{x} \times {\infty}$ , we obtain the compact Hausdorff space $M \cup M^{\infty} ≃ (S_{x} \times (R_{+} \cup {\infty})) / \sim_{0}$ (isomorphic to $B_{x}$ ). This topology on $M \cup M^{\infty}$ is called the cone topology. In this topology, a sequence $x_{i}$ in M converges to $ξ \in M^{\infty}$ if and only if

$d (x, x_{i}) \to \infty$ , and
the sequence $u_{i}$ in $S_{x}$ determined by $x_{i} = \exp_{x} d (x, x_{i}) u_{i}$ converges to $u \in S_{x}$ , where the asymptotic class of geodesic $t \mapsto \exp_{x} t u$ is equal to $ξ$ .

The angle $∠^{\infty} (ξ, ξ^{'})$ of two points $ξ, ξ^{'} \in M^{\infty}$ is defined as $\sup_{x \in M} ∠ (u, u^{'})$ , where u and $u^{'}$ are the representatives of $ξ$ and $ξ^{'}$ , respectively, at $T_{x}$ . The angle defines a metric on $M^{\infty}$ , which induces a different topology. By using the angle metric on $M^{\infty}$ , we can define a metric $d^{\infty}$ on the Euclidean cone $C M^{\infty} ≔ (M^{\infty} \times R_{+}) / \sim_{0}$ of the boundary $M^{\infty}$ by $d^{\infty} {((ξ, r), (ξ^{'}, r^{'}))}^{2} = r^{2} + {(r^{'})}^{2} - 2 r r^{'} \cos ∠^{\infty} (ξ, ξ^{'})$ . This space $C M^{\infty}$ is viewed as the space of asymptotic classes of (not necessarily unit-speed) geodesic rays. It is identified with $T_{x}$ , though the metric space $(C M^{\infty}, d^{\infty})$ has a different topology from $T_{x}$ and is not necessarily a manifold. This metric space $(C M^{\infty}, d^{\infty})$ is a Hadamard space—a complete geodesic metric space satisfying the CAT(0)-inequality (Bridson and Haefliger [9]). It is uniquely geodesic, and its convexity is defined along geodesics. The unit ball $B^{\infty} = {p \in C M^{\infty} ∣ d^{\infty} (0, p) \leq 1}$ around the origin 0 is a convex set, where the origin 0 is the image of point $(ξ, 0)$ . Observe that $B^{\infty}$ can be identified with $B_{x}$ for any $x \in M$ .

2.2.1. Manifold of Positive Definite Matrices and Symmetric Space.

A representative example of a Hadamard manifold is the space $P_{n}$ of $n \times n$ positive definite Hermitian matrices (see Bridson and Haefliger [9, II.10]). The tangent space $T_{x}$ at $x \in P_{n}$ is identified with the real vector space $p_{n}$ of Hermitian matrices, and the Riemannian metric is given by ${〈 G, H 〉}_{x} ≔ tr x^{- 1} H x^{- 1} G$ . In this space, several manifold notions are explicitly written (see, e.g., Hirai et al. [34, section 5.2]). The exponential map $\exp_{x}$ at x is given by $H \mapsto x^{1 / 2} e^{x^{- 1 / 2} H x^{- 1 / 2}} x^{1 / 2}$ , where $e^{•}$ is the matrix exponential. Particularly, any geodesic issuing at x is of form $t \mapsto x^{1 / 2} e^{t x^{- 1 / 2} H x^{- 1 / 2}} x^{1 / 2}$ for some Hermitian matrix $H \in T_{x}$ with $‖ H ‖ = ‖ x^{- 1 / 2} H x^{- 1 / 2} ‖_{F} = 1$ , where $‖ \cdot ‖_{F}$ is the Frobenius norm. An explicit formula of the geodesic parallel transport $τ_{x \to y}$ is also known. We will use one special case: $τ_{x \to I} H = x^{- 1 / 2} H x^{- 1 / 2}$ .

Any totally geodesic subspace M of $P_{n}$ is also a Hadamard manifold. Here, a submanifold $M \subseteq P_{n}$ is said to be totally geodesic if every geodesic in M is also geodesic in $P_{n}$ . It is known (Bridson and Haefliger [9, II.10.58]) that for a connected Lie group $G \subseteq G L_{n}$ defined by polynomials and satisfying $G = G^{†}$ , the submanifold $P_{n} \cap G$ is a totally geodesic subspace. Such a group G is called self-adjoint (or symmetric), and is a reductive algebraic group (see Wallach [61, sections 2.2, 3.1.3, and 3.2]). Here $P_{n} \cap G$ is known as a symmetric space (of nonpositive curvature). A particular case we will face is $G = S L_{n}$ and $P_{n}^{1} ≔ P_{n} \cap S L_{n} = {x \in P_{n} ∣ \det x = 1}$ , where the tangent space $T_{I} (P_{n}^{1})$ at I is given by $p_{n}^{1} ≔ {H \in p_{n} ∣ tr H = 0}$ . It is known (Bridson and Haefliger [9, II.10.71]) that the boundary $M^{\infty}$ at infinity of $M = P_{n} \cap G$ becomes a spherical building, and the associated Euclidean cone $C M^{\infty}$ becomes a Euclidean building. We will consider convex functions on these spaces in Section 4.

2.3. Convex Function

In a Hadamard manifold M, by uniqueness of geodesics, convexity is naturally introduced. A function $f : M \to R$ is said to be convex if for every geodesic $γ : [a, b] \to M$ one-dimensional function $f \circ γ : [a, b] \to R$ is convex. We will assume twice differentiability for the smoothness of f. Then the convexity condition is equivalent to ${(f \circ γ)}^{″} (t) \geq 0$ . From ${(f \circ γ)}^{″} (t) = (d / d t) 〈 \nabla f (γ (t)), \dot{γ} (t) 〉 = 〈 \nabla_{\dot{γ} (t)} \nabla f (γ (t)), \dot{γ} (t) 〉$ , convexity of f is equivalent to positive semidefiniteness of Hessian $\nabla^{2} f (x)$ :

〈 \nabla^{2} f (x) u, u 〉 \geq 0

for all

x \in M, u \in T_{x}

. We also consider the Lipschitz condition for the gradient vector field

\nabla f

. For

L \in R_{+}

, a function

f : M \to R

is said to be L-smooth if

〈 \nabla^{2} f (x) u, u 〉 \leq L 〈 u, u 〉

for all

x \in M

u \in T_{x}

. That is, the operator norm

‖ \nabla^{2} f (x) ‖

is bounded by L for all

x \in M

We next introduce an important tool for studying the unboundedness of convex functions. Let us fix $x_{0} \in M$ . The recession function (asymptotic slope) $f^{\infty} = f_{x_{0}}^{\infty} : M^{\infty} \to R \cup {\infty}$ (Hirai [30], Kapovich et al. [40]) is defined by

f_{x_{0}}^{\infty} (u) ≔ \lim_{s \to \infty} \frac{f (\exp_{x_{0}} s u) - f (x_{0})}{s} = \lim_{s \to \infty} \frac{f (\exp_{x_{0}} s u)}{s}

= \lim_{s \to \infty} \frac{d}{d s} f (\exp_{x_{0}} s u) (u \in S_{x_{0}} ≃ M^{\infty}),

(2.4)

where the limits exist in

R \cup {\infty}

because of convexity of f (monotonicity of

s \mapsto (f (\exp_{x_{0}} s u) - f (x_{0})) / s

and of

s \mapsto (d / d s) f (\exp_{x_{0}} s u)

) and the last equality follows from (2.2) for

h (t) ≔ (d / d t) f (\exp_{x_{0}} t u)

. It is shown by Kleiner and Leeb [44, lemma 2.10] that if

t \mapsto \exp_{x_{0}} t u

and

t \mapsto \exp_{y_{0}} t v

are asymptotic, then

f_{x_{0}}^{\infty} (u) = f_{y_{0}}^{\infty} (v)

.¹ Hence, the recession function

f^{\infty}

is regarded as

M^{\infty} \to R \cup {\infty}

. Further,

f^{\infty}

is naturally extended to

C M^{\infty} \to R \cup {\infty}

by allowing u to any vector in

T_{x_{0}} ≃ C M^{\infty}

. If

M = R^{n}

, then

C M^{\infty} = R^{n}

and

f^{\infty}

matches the recession function in Euclidean convex analysis (see Rockafellar [53, section 8] and Hiriart-Urruty and Lemaréchal [35, section 3.2]). As in the Euclidean case, the following properties hold:

\begin{array}{l} \inf_{ξ \in M^{\infty}} f^{\infty} (ξ) < 0 & \Rightarrow \inf_{x \in M} f (x) = - \infty . \\ \inf_{ξ \in M^{\infty}} f^{\infty} (ξ) > 0 & \Rightarrow \exists x^{*} \in M : f (x^{*}) = \inf_{x \in M} f (x) . \end{array}

(2.5)

The second property is included in Kapovich et al. [40, lemma 3.2 (vi)]. Moreover, it is known (Hirai [30]) that $f^{\infty}$ is a positively homogeneous convex function on Hadamard space $C M^{\infty}$ .

In particular, both $\inf_{ξ \in M^{\infty}} f^{\infty} (ξ) < 0$ and $\inf_{x \in M} ‖ \nabla f (x) ‖ > 0$ are sufficient conditions for unboundedness of f. In fact, they are equivalent.

Proposition 2.1

(Kapovich et al. [40, Lemma 3.2 (iii), Lemma 3.4]; See Also Hirai [30]).

(1) $\inf_{ξ \in M^{\infty}} f^{\infty} (ξ) < 0$ if and only if $\inf_{x \in M} ‖ \nabla f (x) ‖ > 0$ .
(2) If $\inf_{ξ \in M^{\infty}} f^{\infty} (ξ) < 0$ , then there uniquely exists $ξ^{*} \in M^{\infty}$ with $f^{\infty} (ξ^{*}) = \inf_{ξ \in M^{\infty}} f^{\infty} (ξ)$ .

The existence in Proposition 2.1 part (2) follows from the lower semicontinuity of $f^{\infty}$ on the compact space $M^{\infty}$ with respect to the cone topology. The uniqueness of $ξ^{*}$ in part (2) can be seen from the positively homogeneous convexity of $f^{\infty}$ on $C M^{\infty}$ , as in the Euclidean case.²

As a sharpening of the easier part (the only-if part) in part (1), we here mention the following weak duality relation between the gradient-norm and the recession function.

Lemma 2.2

(Weak Duality). $\inf_{x \in M} ‖ \nabla f (x) ‖ \geq \sup_{ξ \in B^{\infty}} - f^{\infty} (ξ) .$

Proof.

For $x \in M$ and $ξ \in B_{x} ≃ B^{\infty}$ , it holds that

f^{\infty} (ξ) = \lim_{t \to \infty} \frac{f (\exp_{x} t ξ) - f (x)}{t} \geq \lim_{t \to 0} \frac{f (\exp_{x} t ξ) - f (x)}{t} = 〈 \nabla f (x), ξ 〉 \geq - ‖ \nabla f (x) ‖,

where the first inequality follows from convexity of f (monotonicity of

t \mapsto (f (\exp_{x} t ξ) - f (x)) / t

) and the last inequality follows from Cauchy-Schwarz and

‖ ξ ‖ \leq 1

. □

In Section 3, we show, via the gradient flow of f, that the equality (strong duality) always holds. This technique may be viewed as a refinement of the proof of the if-part in Kapovich et al. [40, proposition 2.1 (1)], in which the limit of the normalized gradient flow of f constructs $ξ$ with $f^{\infty} (ξ) < 0$ . A similar gradient-flow approach can be found in the setting of GIT (Chen and Sun [15], Georgoulas et al. [24], Woodward [62]) (see Section 4.1).

3. Asymptotic Behavior of Gradient Flow

3.1. Continuous-Time Gradient Flow

Throughout, M denotes a Hadamard manifold. Let $f : M \to R$ be a twice differentiable convex function. Consider the following differential equation—the gradient flow of f,

\frac{d x (t)}{d t} = - \nabla f (x (t)), x (0) = x_{0} .

(3.1)

It is clear that the trajectory $x (t)$ is going to minimize f; see Lemma 3.2 part (2) below. In fact, if a minimizer of f exists, then $x (t)$ converges to a minimizer. This convergence is known for the general setting of Hadamard spaces (see, e.g., Bačák [5, theorem 5.1.16] and Mayer [47, theorem 2.41]). Our focus is on the case where f is unbounded below, particularly the case where the minimum gradient-norm is positive. We establish the following convergence of an unbounded gradient flow and strong duality between the gradient-norm and the recession function.

Theorem 3.1.

Suppose that $κ^{*} ≔ \inf_{x \in M} ‖ \nabla f (x) ‖ > 0$ . Let $x (t)$ be the solution of (3.1).

(1) $‖ \nabla f (x (t)) ‖$ converges to the minimum gradient-norm $κ^{*}$ , and
(2) $x (t)$ converges, in cone topology, to the unique minimizer $ξ^{*}$ of $f^{\infty}$ over $M^{\infty}$ ,

where the following equality holds:

\lim_{t \to \infty} ‖ \nabla f (x (t)) ‖ = \inf_{x \in M} ‖ \nabla f (x) ‖ = \sup_{ξ \in M^{\infty}} - f^{\infty} (ξ) = - f^{\infty} (\lim_{t \to \infty} x (t)) .

(3.2)

We should mention related results. In the general setting of Hadamard space X, Caprace and Lytchak [14, proposition 4.2] showed that the gradient-flow curve of a Lipschitz convex function with $κ^{*} > 0$ converges to a point in the boundary $X^{\infty}$ of X. Their proof relies on a very general result of Karlsson and Margulis [41, theorem 2.1] for semicontraction semigroups in uniformly convex spaces. Here it is well-known³ that the gradient-flow semigroup $ϕ_{t}$ satisfies the (semi)contraction property:

d (ϕ_{t} (x), ϕ_{t} (y)) \leq d (x, y) (t \in R_{+}, x, y \in M),

(3.3)

where

ϕ_{t} (x)

is the solution of (3.1) with initial point

x (0) = x

. If the velocity of escape

κ^{*} (x) ≔ \underset{t \to \infty}{lim sup} \frac{d (ϕ_{t} (x), x)}{t}

(3.4)

is positive, then the result of Karlsson and Margulis [41, theorem 2.1] is applicable for convergence of

ϕ_{t} (x)

M^{\infty}

; Caprace and Lytchak [14] actually showed that

κ^{*} > 0

implies

κ^{*} (x) > 0

. Although one can deduce the entire statement of Theorem 3.1 from this with more effort, we take a different approach that relies neither on Karlsson and Margulis [41] nor on the contraction property (3.3). As mentioned after Lemma 2.2, our proof is partly inspired by an idea in Kapovich et al. [40], but it directly establishes the relation (3.2). An advantage of this approach is that it can adapt to the discrete setting in Section 3.2.

We start with the following well-known properties of gradient flows.

Lemma 3.2.

(1) The solution $x (t)$ of (3.1) is defined on $R_{+}$ .
(2) $t \mapsto f (x (t))$ is nonincreasing.
(3) $t \mapsto ‖ \nabla f (x (t)) ‖$ is nonincreasing.

We describe a proof because the intermediate equations will be used.

Proof.

Lemma 3.2 part (2) follows from $(d / d t) f (x (t)) = 〈 \nabla f (x (t)), \dot{x} (t) 〉 = - ‖ \nabla f (x (t)) ‖^{2} \leq 0 .$

Part (3) follows from $(d / d t) ‖ \nabla f (x (t)) ‖^{2} = - 2 〈 \nabla^{2} f (x (t)) \dot{x} (t), \dot{x} (t) 〉 \leq 0$ by convexity of f (positive semidefiniteness of $\nabla^{2} f (x (t))$ ).

Part (1). Suppose that $x (t)$ is defined on $[0, T)$ for finite $T > 0$ . For $0 \leq t \leq t^{'} < T$ , it holds that

d (x (t), x (t^{'})) \leq \int_{t}^{t^{'}} ‖ \dot{x} (s) ‖ d s \leq ‖ \nabla f (x_{0}) ‖ (t^{'} - t),

where the second inequality follows from part (3). Therefore,

x (t)

is Cauchy for

t \to T

. Because M is complete, the limit

x^{*} ≔ \lim_{t \to T} x (x)

exists in M. Then

x (t)

is connected to the solution of

\dot{y} (t) = - \nabla f (y (t))

y (0) = x^{*}

, and is defined on

[0, T + ϵ)

for some

ϵ > 0

. If we take maximal T, it must be

T = \infty

. □

Proof of Theorem 3.1.

Let $κ ≔ \lim_{t \to \infty} ‖ \nabla f (x (t)) ‖ \geq κ^{*} > 0$ . First, we note

f (x (t)) - f (x_{0}) = \int_{0}^{t} \frac{d}{d τ} f (x (τ)) d τ = - \int_{0}^{t} ‖ \nabla f (x (τ)) ‖^{2} d τ \leq - κ^{2} t,

(3.5)

d (x (t), x_{0}) \leq \int_{0}^{t} ‖ \dot{x} (τ) ‖ d τ = \int_{0}^{t} ‖ \nabla f (x (τ)) ‖ d τ,

(3.6)

where the last inequality in (3.5) follows from Lemma 3.2 part (3). Then it holds that

d (x (t), x_{0}) \to \infty

for

t \to \infty

. Otherwise,

x (t)

has an accumulation point

x^{*}

in M and

f (x^{*}) = - \infty

by (3.5), contradicting

f (x^{*}) \in R

Define $u (t) \in S_{x_{0}}$ via $x (t) = \exp_{x_{0}} d (x (t), x_{0}) u (t) .$ For $s \in (0, d (x (t), x_{0})]$ , by convexity of f along the geodesic from $x_{0}$ to $x (t)$ , it holds that

f (\exp_{x_{0}} s u (t)) - f (x_{0}) \leq \frac{s}{d (x (t), x_{0})} (f (x (t)) - f (x_{0})) .

From this, we have

\begin{array}{l} \frac{f (\exp_{x_{0}} s u (t)) - f (x_{0})}{s} \leq \frac{f (x (t)) - f (x_{0})}{d (x (t), x_{0})} \leq - \frac{\int_{0}^{t} ‖ \nabla f (x (τ)) ‖^{2} d τ}{\int_{0}^{t} ‖ \nabla f (x (τ)) ‖ d τ} \\ \leq - \frac{1}{t} \int_{0}^{t} ‖ \nabla f (x (τ)) ‖ d τ \leq - κ, \end{array}

where the second inequality follows from (3.5) and (3.6), the third from the Cauchy-Schwartz inequality

{(\int_{0}^{t} F (τ) G (τ) d τ)}^{2} \leq \int_{0}^{t} F {(τ)}^{2} d τ \int_{0}^{t} G {(τ)}^{2} d τ

for

F (τ) ≔ ‖ \nabla f (x (τ)) ‖

and

G (τ) ≔ 1

, and the fourth from Lemma 3.2 part (3).

Choose any convergence subsequence $u (t_{i})$ with $t_{i} \to \infty$ $(d (x (t_{i}), x_{0}) \to \infty)$ and $u (t_{i}) \to u^{*}$ . Then it holds that

\frac{f (\exp_{x_{0}} s u^{*}) - f (x_{0})}{s} \leq - κ .

For $s \to \infty$ , we have $f^{\infty} (u^{*}) \leq - κ .$ Then, we have

\inf_{ξ \in M^{\infty}} f^{\infty} (ξ) \leq f^{\infty} (u^{*}) \leq - κ \leq - κ^{*} = \sup_{x \in M} - ‖ \nabla f (x) ‖ \leq \inf_{ξ \in M^{\infty}} f^{\infty} (ξ),

where we use the weak duality (Lemma 2.2) for the last inequality. This shows

κ = κ^{*}

and proves (3.2). Because the minimizer

ξ^{*}

f^{\infty}

over

M^{\infty}

uniquely exists (Proposition 2.1 part (2)), it must hold that

ξ^{*} = u^{*}

. We showed that any convergent subsequence

u (t_{i})

u (t)

converges to

ξ^{*}

. Because

S_{x_{0}}

is compact,

u (t)

itself converges to

ξ^{*}

. □

Even if $κ^{*} = 0$ , the strong duality holds (because $f^{\infty} (0) = 0$ ).

Corollary 3.3.

$\inf_{x \in M} ‖ \nabla f (x) ‖ = \sup_{ξ \in B^{\infty}} - f^{\infty} (ξ) .$

The velocity of escape (3.4) coincides with the minimum gradient-norm.

Proposition 3.4.

Suppose that $κ^{*} ≔ \inf_{x \in M} ‖ \nabla f (x) ‖ > 0$ . Let $ξ^{*} \in S_{x_{0}}$ denote the representative of the unique minimizer of $f^{\infty}$ over $M^{\infty} ≃ S_{x_{0}}$ . Then the following hold:

(1) $\lim_{t \to \infty} \frac{d (x_{0}, x (t))}{t} = κ^{*}$ .
(2) $\lim_{t \to \infty} \frac{\exp_{x_{0}}^{- 1} x (t)}{t} = κ^{*} ξ^{*}$ .

Proof.

Part (1). For $t > s \geq 0$ , it holds that $d (x (s), x (t)) \leq \int_{s}^{t} ‖ \nabla f (x (τ)) ‖ d τ \leq ‖ \nabla f (x (s)) ‖ (t - s)$ (by Lemma 3.2 part (3)). Hence,

\underset{t \to \infty}{lim sup} \frac{d (x_{0}, x (t))}{t} = \underset{t \to \infty}{lim sup} \frac{d (x (s), x (t))}{t - s} \leq ‖ \nabla f (x (s)) ‖ \underset{s \to \infty}{\to} κ^{*},

(3.7)

where the convergence of

‖ \nabla f (x (s)) ‖

κ^{*}

follows from Theorem 3.1 part (1). On the other hand, by taking the unit-speed geodesic

γ

from

x (s)

x (t)

, we have

\begin{array}{l} - ‖ \nabla f (x (t)) ‖^{2} (t - s) \geq - \int_{s}^{t} ‖ \nabla f (x (τ)) ‖^{2} d τ = f (x (t)) - f (x (s)) \\ \geq 〈 \dot{γ} (0), \nabla f (x (s)) 〉 d (x (s), x (t)) \geq - ‖ \nabla f (x (s)) ‖ d (x (s), x (t)), \end{array}

where the first equality follows from Lemma 3.2 part (3), the second inequality from convexity of f along

γ

, and the last from the Cauchy-Schwarz inequality. Thus, it holds that

\begin{array}{l} \underset{t \to \infty}{lim inf} \frac{d (x_{0}, x (t))}{t} \geq \underset{t \to \infty}{lim inf} \frac{d (x (t), x (s)) - d (x_{0}, x (s))}{t} = \underset{t \to \infty}{lim inf} \frac{d (x (t), x (s))}{t - s} \\ \geq \frac{\lim_{t \to \infty} ‖ \nabla f (x (t)) ‖^{2}}{‖ \nabla f (x (s)) ‖} = \frac{{(κ^{*})}^{2}}{‖ \nabla f (x (s)) ‖} \underset{s \to \infty}{\to} κ^{*} . \end{array}

(3.8)

By (3.7) and (3.8), we have

κ^{*} \leq \underset{t \to \infty}{lim inf} \frac{d (x_{0}, x (t))}{t} \leq \underset{t \to \infty}{lim sup} \frac{d (x_{0}, x (t))}{t} \leq κ^{*} .

Part (2). By Theorem 3.1, it holds that $\lim_{t \to \infty} \frac{\exp_{x_{0}}^{- 1} x (t)}{d (x_{0}, x (t))} = ξ^{*}$ . Therefore, by part (1), we have

\lim_{t \to \infty} \frac{\exp_{x_{0}}^{- 1} x (t)}{t} = \lim_{t \to \infty} \frac{\exp_{x_{0}}^{- 1} x (t)}{d (x_{0}, x (t))} \frac{d (x_{0}, x (t))}{t} = κ^{*} ξ^{*} . □

We next consider “convergence” of the gradient $\nabla f (x (t))$ . Because the space $T_{x (t)}$ varies, the convergence concept of $\nabla f (x (t))$ is less obvious. In our intuition, $\nabla f (x (t))$ and $ξ^{*}$ would have opposite directions in the limit. The following partially justifies this intuition.

Proposition 3.5.

\underset{t \to \infty}{lim inf} ‖ τ_{x (t) \to x_{0}} \nabla f (x (t)) + κ^{*} ξ^{*} ‖ = 0 .

Question 3.6.

Does $\lim_{t \to \infty} τ_{x (t) \to x_{0}} \nabla f (x (t)) = - κ^{*} ξ^{*}$ hold?

We will see in Section 4 that this property has important consequences.

Proof of Proposition 3.5.

Let $γ_{t}$ be the unit-speed geodesic from $x_{0}$ to $x (t)$ . Let $d (t) ≔ d (x_{0}, x (t))$ . Then, by Sakai [56, chapter III, proposition 4.8 (1)], it holds that $d {(t)}^{'} = 〈 {\dot{γ}}_{t} (d (t)), \dot{x} (t) 〉 .$ Therefore, we have

\underset{t \to \infty}{lim sup} d {(t)}^{'} = \underset{t \to \infty}{lim sup} 〈 {\dot{γ}}_{t} (d (t)), \dot{x} (t) 〉 \leq \lim_{t \to \infty} ‖ \dot{x} (t) ‖ = \lim_{t \to \infty} ‖ \nabla f (x (t)) ‖ = κ^{*} .

(3.9)

On the other hand, by Proposition 3.4, it holds that $κ^{*} = {lim sup}_{t \to \infty} d (t) / t \leq {lim sup}_{t \to \infty} d {(t)}^{'}$ , where the inequality follows from (2.2) with $h (t) ≔ d^{'} (t)$ . Thus, the equality holds in (3.9). Necessarily, we have

\underset{t \to \infty}{lim sup} ∠ ({\dot{γ}}_{t} (d (t)), \nabla f (x (t))) = π .

(3.10)

By $‖ \nabla f (x (t)) ‖ \to κ^{*}$ , we have ${lim inf}_{t \to \infty} ‖ \nabla f (x (t)) + κ^{*} {\dot{γ}}_{t} (d (t)) ‖ = 0$ . With parallel transport $τ_{x (t) \to x_{0}}$ and ${\dot{γ}}_{t} (0) \to ξ^{*}$ , we have the claim. □

3.2. Discrete-Time Gradient Flow (Gradient Descent)

Next, we consider the discrete version. Suppose that $f : M \to R$ is an L-smooth convex function. Consider the following sequence:

x_{i + 1} ≔ \exp_{x_{i}} (- \frac{1}{L} \nabla f (x_{i})) (i = 0, 1, \dots) .

(3.11)

This is nothing but the trajectory generated by gradient descent with initial point $x_{0}$ and step-size $1 / L$ ; we discuss in Remark 3.13 another type of discrete gradient flow. The convergence/accumulation of $x_{i}$ to a minimizer of f can be shown under several reasonable assumptions (see, e.g., Boumal [8, theorem 11.29]). For the unbounded case, as in the continuous setting, we establish the following.

Theorem 3.7.

Suppose that $κ^{*} ≔ \inf_{x \in M} ‖ \nabla f (x) ‖ > 0$ . Let $x_{i}$ be the sequence in (3.11).

(1) $‖ \nabla f (x_{i}) ‖$ converges to the minimum gradient-norm $κ^{*}$ , and
(2) $x_{i}$ converges, in cone topology, to the unique minimizer $ξ^{*} \in M^{\infty}$ of $f^{\infty}$ .

Hence, the following holds:

\lim_{i \to \infty} ‖ \nabla f (x_{i}) ‖ = \inf_{x \in M} ‖ \nabla f (x) ‖ = \sup_{ξ \in M^{\infty}} - f^{\infty} (ξ) = - f^{\infty} (\lim_{i \to \infty} x_{i}) .

(3.12)

Our original attempt proving this was to establish the contraction property

d (ϕ_{i} (x), ϕ_{i} (y)) \leq d (x, y) (x, y \in M, i = 1, 2, \dots),

(3.13)

for the semigroup

ϕ_{i}

of (3.11), and to apply the approach of Caprace and Lytchak [14] and Karlsson and Margulis [41]. However, we were unable to do so, and we do not know whether (3.13) is true. Note that (3.13) is true in Euclidean space

M = R^{n}

(see, e.g., Sanz Serna and Zygalakis [57, example 1]).

The proof goes in a way analogous to Theorem 3.1. Corresponding to Lemma 3.2, the following properties hold.

Lemma 3.8.

(1) $f (x_{i + 1}) \leq f (x_{i}) - \frac{1}{L} ‖ \nabla f (x_{i + 1}) ‖^{2}$ .
(2) $‖ \nabla f (x_{i + 1}) ‖ \leq ‖ \nabla f (x_{i}) ‖$ .

Contrary to the well-known inequality $f (x_{i + 1}) \leq f (x_{i}) - (1 / 2 L) ‖ \nabla f (x_{i}) ‖^{2}$ (see Boumal [8, (11.15)]), our inequality Lemma 3.8 part (1) seems less well-known; see Remark 3.14 for further discussion.

Proof.

Part (2). Let $γ (t) ≔ \exp_{x_{i}} - t \nabla f (x_{i})$ . Then we have

\begin{array}{l} τ_{γ}^{- 1 / L} \nabla f (x_{i + 1}) & = \nabla f (x_{i}) + \int_{0}^{1 / L} \frac{d}{d s} τ_{γ}^{- s} \nabla f (γ (s)) d s \\ = \nabla f (x_{i}) + \int_{0}^{1 / L} τ_{γ}^{- s} \nabla_{\dot{γ} (s)} \nabla f (γ (s)) d s \\ = \nabla f (x_{i}) + \int_{0}^{1 / L} τ_{γ}^{- s} \nabla^{2} f (γ (s)) \dot{γ} (s) d s \\ = L \int_{0}^{1 / L} τ_{γ}^{- s} (I - \frac{1}{L} \nabla^{2} f (γ (s))) τ_{γ}^{s} \nabla f (x_{i}) d s, \end{array}

(3.14)

where we use the definition (2.3) of

\nabla^{2}

and

\dot{γ} (s) = τ_{γ}^{s} \dot{γ} (0) = - τ_{γ}^{s} \nabla f (x_{i})

γ

is a geodesic. Because

〈, 〉

is invariant under parallel transport, the operator norm of

τ_{γ}^{- s} (I - (1 / L) \nabla^{2} f (γ (s))) τ_{γ}^{s}

is equal to that of

I - (1 / L) \nabla^{2} f (γ (s))

. By convexity and L-smoothness, all eigenvalues of

\nabla^{2} f (γ (s))

belong to [0, L]. Hence, we have

\begin{array}{l} ‖ \nabla f (x_{i + 1}) ‖ & = ‖ τ_{γ}^{- 1 / L} \nabla f (x_{i + 1}) ‖ \leq L \int_{0}^{1 / L} ‖ I - \frac{1}{L} \nabla^{2} f (γ (s)) ‖ ‖ \nabla f (x_{i}) ‖ d s \\ \leq ‖ \nabla f (x_{i}) ‖, \end{array}

which proves part (2).

We now prove part (1). From (3.14), we have

\begin{array}{l} ‖ τ_{γ}^{- 1 / L} \nabla f (x_{i + 1}) - \frac{1}{2} \nabla f (x_{i}) ‖ = L ‖ \int_{0}^{1 / L} τ_{γ}^{- s} (\frac{1}{2} I - \frac{1}{L} \nabla^{2} f (γ (s))) τ_{γ}^{s} \nabla f (x_{i}) d s ‖ \\ \leq L \int_{0}^{1 / L} ‖ (\frac{1}{2} I - \frac{1}{L} \nabla^{2} f (γ (s))) ‖ ‖ \nabla f (x_{i}) ‖ d s \leq \frac{1}{2} ‖ \nabla f (x_{i}) ‖ . \end{array}

By squaring this and applying the rearrangement $‖ a - b ‖^{2} \leq ‖ b ‖^{2} \Rightarrow ‖ a ‖^{2} \leq 2 〈 a, b 〉$ , we have $‖ τ_{γ}^{- 1 / L} \nabla f (x_{i + 1}) ‖^{2} \leq 〈 τ_{γ}^{- 1 / L} \nabla f (x_{i + 1}), \nabla f (x_{i}) 〉$ , particularly,

‖ \nabla f (x_{i + 1}) ‖^{2} \leq 〈 \nabla f (x_{i + 1}), τ_{γ}^{1 / L} \nabla f (x_{i}) 〉 .

(3.15)

From convexity, it holds that

\begin{array}{l} f (x_{i}) \geq f (x_{i + 1}) + \frac{1}{L} \frac{d}{d t} f (γ (1 / L - t)) ∣_{t = 0} = f (x_{i + 1}) - \frac{1}{L} 〈 \nabla f (x_{i + 1}), \dot{γ} (1 / L) 〉 \\ = f (x_{i + 1}) + \frac{1}{L} 〈 \nabla f (x_{i + 1}), τ_{γ}^{1 / L} \nabla f (x_{i}) 〉 \geq f (x_{i + 1}) + \frac{1}{L} ‖ \nabla f (x_{i + 1}) ‖^{2}, \end{array}

where we use (3.15) for the last inequality. □

Proof of Theorem 3.7.

The proof is similar to that of Theorem 3.1. Let $κ ≔ \lim_{i \to \infty} ‖ \nabla f (x_{i}) ‖ \geq κ^{*}$ . For $i > 0$ , we have

f (x_{i}) - f (x_{0}) \leq - \frac{1}{L} \sum_{k = 1}^{i} ‖ \nabla f (x_{k}) ‖^{2} \leq - \frac{i}{L} κ,

(3.16)

d (x_{i}, x_{0}) \leq \sum_{k = 0}^{i - 1} d (x_{k}, x_{k + 1}) = \frac{1}{L} \sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖,

(3.17)

where (3.16) follows from Lemma 3.8 and (3.17) follows from the triangle inequality and

d (x, \exp_{x} u) = ‖ u ‖

with (3.11). Then

d (x_{i}, x_{0}) \to \infty

is shown as in the proof of Theorem 3.1.

Let $u_{i} \in S_{x_{0}}$ be defined via $x_{i} = \exp_{x_{0}} d (x_{i}, x_{0}) u_{i} .$ For $s \in (0, d (x_{i}, x_{0})]$ , by convexity of f along geodesic $s \mapsto \exp_{x_{0}} s u_{i}$ , it holds that

f (\exp_{x_{0}} s u_{i}) - f (x_{0}) \leq \frac{s}{d (x_{i}, x_{0})} (f (x_{i}) - f (x_{0})) .

From this, we have

\begin{array}{l} \frac{f (\exp_{x_{0}} s u_{i}) - f (x_{0})}{s} \leq \frac{f (x_{i}) - f (x_{0})}{d (x_{i}, x_{0})} \leq \frac{- \sum_{k = 1}^{i} ‖ \nabla f (x_{k}) ‖^{2}}{d (x_{i}, x_{0})} \leq \frac{- \sum_{k = 1}^{i} ‖ \nabla f (x_{k}) ‖^{2}}{\sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖} \\ = - \frac{\sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖^{2}}{\sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖} + \frac{‖ \nabla f (x_{0}) ‖^{2} - ‖ \nabla f (x_{i}) ‖^{2}}{\sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖} \\ \leq - \frac{1}{i} \sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖ + \frac{‖ \nabla f (x_{0}) ‖^{2}}{\sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖} \leq - κ + \frac{1}{i} \frac{‖ \nabla f (x_{0}) ‖^{2}}{κ}, \end{array}

(3.18)

where the second inequality follows from (3.16), the third from (3.17) and the negativity of the numerator, the fourth from the Cauchy-Schwarz inequality

{(\sum_{k} F_{k} G_{k})}^{2} \leq \sum_{k} F_{k}^{2} \sum_{k} G_{k}^{2}

, and the fifth from Lemma 3.8 part (2).

Choose any convergent subsequence ${u_{i_{k}}}$ of ${u_{i}}$ , which converges to $u^{*} \in S_{x_{0}}$ . The second term of (3.18) vanishes as $i_{k} \to \infty$ . Then it holds that

\frac{f (\exp_{x_{0}} s u^{*}) - f (x_{0})}{s} \leq - κ .

By $s \to \infty$ , we have $f^{\infty} (u^{*}) \leq - κ .$ The rest is the same as the last part of the proof of Theorem 3.1. □

We note the limiting behavior of the decrement of $f (x_{i})$ and the change of $\nabla f (x_{i})$ .

Lemma 3.9.

(1) $\lim_{i \to \infty} f (x_{i + 1}) - f (x_{i}) = - \frac{{(κ^{*})}^{2}}{L}$ .
(2) $\lim_{i \to \infty} ‖ τ_{x_{i} \to x_{i + 1}} \nabla f (x_{i}) - \nabla f (x_{i + 1}) ‖ = 0$ .

Proof.

Part (1). By convexity and Lemma 3.8 part (1), we have

- \frac{1}{L} ‖ \nabla f (x_{i}) ‖^{2} \leq f (x_{i + 1}) - f (x_{i}) \leq - \frac{1}{L} ‖ \nabla f (x_{i + 1}) ‖^{2} .

By $i \to \infty$ with Theorem 3.7, we have the claim.

Part (2). The inequality (3.15) is also written as

‖ \nabla f (x_{i + 1}) ‖^{2} \leq ‖ \nabla f (x_{i}) ‖ ‖ \nabla f (x_{i + 1}) ‖ \cos ∠ (\nabla f (x_{i + 1}), τ_{x_{i} \to x_{i + 1}} \nabla f (x_{i})) .

By $‖ \nabla f (x_{i}) ‖ \to κ^{*}$ , we have $∠ (\nabla f (x_{i + 1}), τ_{x_{i} \to x_{i + 1}} \nabla f (x_{i})) \to 0$ , and the claim follows. □

The discrete version of Proposition 3.4 is the following.

Proposition 3.10.

(1) $\lim_{i \to \infty} \frac{d (x_{0}, x_{i})}{i} = \frac{κ^{*}}{L}$ .
(2) $\lim_{i \to \infty} \frac{\exp_{x_{0}}^{- 1} x_{i}}{i} = \frac{κ^{*} ξ^{*}}{L}$ .

Proof.

Part (1). As in (3.17), it holds that $d (x_{0}, x_{i}) \leq \sum_{k = 0}^{i - 1} d (x_{k}, x_{k + 1}) = \sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖ / L$ . Hence, with (2.1) for $a_{i} ≔ ‖ \nabla f (x_{i}) ‖$ , we have

\underset{i \to \infty}{lim sup} \frac{d (x_{0}, x_{i})}{i} \leq \frac{1}{L} \underset{i \to \infty}{lim sup} \frac{1}{i} \sum_{k = 0}^{i - 1} ‖ \nabla f (x_{k}) ‖ \leq \frac{1}{L} \underset{k \to \infty}{lim sup} ‖ \nabla f (x_{k}) ‖ = \frac{κ^{*}}{L} .

(3.19)

On the other hand, for arbitrary $0 \leq i < j$ , we have

\begin{array}{l} - \frac{j - i}{L} ‖ \nabla f (x_{j}) ‖^{2} \geq - \frac{1}{L} \sum_{k = i + 1}^{j} ‖ \nabla f (x_{k}) ‖^{2} \geq (f (x_{j}) - f (x_{i})) \\ \geq 〈 \dot{γ} (0), \nabla f (x_{i}) 〉 d (x_{i}, x_{j}) \geq - ‖ \nabla f (x_{i}) ‖ d (x_{i}, x_{j}), \end{array}

where the first inequality follows from Lemma 3.8 part (2), the second from Lemma 3.8 part (1), and the third from the convexity of f along unit-speed geodesic

γ

from

x_{i}

x_{j}

. Thus, for arbitrary

i \geq 0

, it holds that

\begin{array}{l} \underset{j \to \infty}{lim inf} \frac{d (x_{0}, x_{j})}{j} \geq \underset{j \to \infty}{lim inf} \frac{d (x_{i}, x_{j}) - d (x_{0}, x_{i})}{j} = \underset{j \to \infty}{lim inf} \frac{d (x_{i}, x_{j})}{j - i} \\ \geq \frac{1}{L} \underset{j \to \infty}{lim inf} \frac{‖ \nabla f (x_{j}) ‖^{2}}{‖ \nabla f (x_{i}) ‖} = \frac{1}{L} \frac{{(κ^{*})}^{2}}{‖ \nabla f (x_{i}) ‖} \underset{i \to \infty}{\to} \frac{κ^{*}}{L} . \end{array}

(3.20)

By (3.19) and (3.20), we have

\frac{κ^{*}}{L} \leq \underset{i \to \infty}{lim inf} \frac{d (x_{0}, x_{i})}{i} \leq \underset{i \to \infty}{lim sup} \frac{d (x_{0}, x_{i})}{i} \leq \frac{κ^{*}}{L} .

Part (2). As in the proof of Proposition 3.4 part (2), by Theorem 3.7 and the above part (1), we have

\lim_{i \to \infty} \frac{\exp_{x_{0}}^{- 1} x_{i}}{i} = \lim_{i \to \infty} \frac{\exp_{x_{0}}^{- 1} x_{i}}{d (x_{0}, x_{i})} \frac{d (x_{0}, x_{i})}{i} = \frac{κ^{*} ξ^{*}}{L} . □

For convergence of $\nabla f (x_{i})$ , the same property of Proposition 3.5 holds:

Proposition 3.11.

\underset{i \to \infty}{lim inf} ‖ τ_{x_{i} \to x_{0}} \nabla f (x_{i}) + κ^{*} ξ^{*} ‖ = 0 .

Question 3.12.

Does $\lim_{i \to \infty} τ_{x_{i} \to x_{0}} \nabla f (x_{i}) = - κ^{*} ξ^{*}$ hold?

Proof of Proposition 3.11.

Let $d_{i} ≔ d (x_{0}, x_{i})$ . We first show

\underset{i \to \infty}{lim sup} d_{i + 1} - d_{i} = κ^{*} / L .

(3.21)

Indeed, by the triangle inequality and Theorem 3.7 part (1), we have ${lim sup}_{i \to \infty} d_{i + 1} - d_{i} \leq {lim sup}_{i \to \infty} d (x_{i}, x_{i + 1}) = {lim sup}_{i \to \infty} ‖ \nabla f (x_{i}) ‖ / L = κ^{*} / L$ . On the other hand, by Proposition 3.10 part (1), it holds that $κ^{*} / L = {lim sup}_{i \to \infty} d_{i} / i \leq {lim sup}_{i \to \infty} d_{i + 1} - d_{i}$ , where the inequality follows from (2.1) for $a_{i} ≔ d_{i + 1} - d_{i}$ .

Consider the geodesic triangle of vertices $x_{0}, x_{i - 1}, x_{i}$ . Let $γ_{i}$ denote the unit-speed geodesic from $x_{0}$ to $x_{i}$ . Let $θ_{i}$ denote the angle at vertex $x_{i}$ of this triangle. Then

θ_{i} = ∠ ({\dot{γ}}_{i} (d_{i}), - τ_{x_{i - 1} \to x_{i}} \nabla f (x_{i - 1})) .

By the law of cosines in CAT(0) space M (see, e.g., Bridson and Haefliger [9, II.1.9 (2)]), we have

\cos θ_{i} \geq \frac{d_{i}^{2} + d {(x_{i - 1}, x_{i})}^{2} - d_{i - 1}^{2}}{2 d_{i} d (x_{i - 1}, x_{i})} = \frac{d (x_{i - 1}, x_{i})}{2 d_{i}} + \frac{1}{2} (1 + \frac{d_{i - 1}}{d_{i}}) \frac{d_{i} - d_{i - 1}}{d (x_{i - 1}, x_{i})} .

Take ${lim sup}_{i \to \infty}$ in this inequality. By $d_{i} = d (x_{0}, x_{i}) \to \infty$ , $d (x_{i - 1}, x_{i}) = ‖ \nabla f (x_{i - 1}) ‖ / L \to κ^{*} / L$ (from Theorem 3.7 part (1)), $d_{i - 1} / d_{i} \to 1$ (seen from Proposition 3.10 part (1)), and (3.21), we have ${lim sup}_{i \to \infty} \cos θ_{i} \geq 1$ , and ${lim inf}_{i \to \infty} θ_{i} = 0 .$ By Lemma 3.9 part (2), it holds that $∠ (\nabla f (x_{i}), τ_{x_{i - 1} \to x_{i}} \nabla f (x_{i - 1})) \to 0$ and

\underset{i \to \infty}{lim sup} ∠ ({\dot{γ}}_{i} (d_{i}), \nabla f (x_{i})) = π .

By taking parallel transport $τ_{x_{i} \to x_{0}}$ and ${\dot{γ}}_{i} (0) \to ξ^{*}$ , we have the claim. □

Remark 3.13.

Another type of discrete gradient flow, well-studied in the literature of nonpositively curved space (see Bačák [5], Mayer [47], and Ohta and Pálfia [52]), is defined via the resolvent map $J_{λ}^{f} : M \to M$ ,

J_{λ}^{f} (x) ≔ \underset{y \in M}{arg min} f (y) + \frac{1}{2 λ} d {(x, y)}^{2} (x \in M),

(3.22)

where

λ

is a positive parameter. Let

λ_{i}

be a sequence of positive reals (satisfying

λ_{i} \to 0

and

\sum_{i} λ_{i} \to \infty

). Then a discrete analogue (proximal point method) of gradient flow is as follows:

x_{i + 1} = J_{λ_{i}}^{f} (x_{i}) (i = 0, 1, \dots) .

(3.23)

For our manifold case, it can be written as an implicit difference scheme:

x_{i} = \exp_{x_{i + 1}} λ_{i} \nabla f (x_{i + 1}) .

(3.24)

Several nice (convergence) properties are known for the sequence of (3.23). For example, the contraction property (3.13) holds for the semigroup of (3.23) (see Bačák [5, theorem 2.2.23]). On the other hand, solving (3.22) is a nontrivial task from an algorithmic point of view.

Remark 3.14.

In the case of $M = R^{n}$ , Lemma 3.8 part (1) can be easily obtained from a known inequality. For an L-smooth convex function f in $R^{n}$ , the following inequality holds (e.g., Beck [6, theorem 5.8 (iii)]):

f (y) - f (x) \geq 〈 \nabla f (x), y - x 〉 + \frac{1}{2 L} ‖ \nabla f (x) - \nabla f (y) ‖^{2} (x, y \in R^{n}),

though we do not know a reasonable manifold version to hold. By substituting

x = x_{i + 1}, y = x_{i}

, and using

x_{i} - x_{i + 1} = \nabla f (x_{i}) / L

and

‖ \nabla f (x_{i}) ‖ \geq ‖ \nabla f (x_{i + 1}) ‖

(Lemma 3.8 part (2)), we have Lemma 3.8 part (1):

f (x_{i + 1}) \leq f (x_{i}) - \frac{1}{2 L} (‖ \nabla f (x_{i}) ‖^{2} + ‖ \nabla f (x_{i + 1}) ‖^{2}) \leq f (x_{i}) - \frac{1}{L} ‖ \nabla f (x_{i + 1}) ‖^{2} .

3.3. Euclidean Specialization

Here, we present refinements of the above results for the Euclidean setting $M = R^{n}$ . As far as our knowledge, the above convergence results on the gradient flow/descent seem new even in this special case, and are further sharpened as follows. In the Euclidean space $M = R^{n}$ , the tangent space $T_{x}$ is also identified with $R^{n}$ for every $x \in M$ , where the inner product is given by $〈 u, v 〉 ≔ u^{⊤} v$ . The parallel transport $τ_{γ}$ for any path $γ$ is the identity map. Let $f : R^{n} \to R$ be a (smooth) convex function. We assume L-smoothness of f when Gradient Descent (3.11) is considered. The gradient $\nabla f (x) \in R^{n}$ and Hessian $\nabla^{2} f (x) \in R^{n \times n}$ are obtained by ${(\nabla f (x))}_{i} = (\partial / \partial x_{i}) f (x)$ and ${(\nabla^{2} f (x))}_{i j} = (\partial^{2} / \partial x_{i} \partial x_{j}) f (x)$ , respectively.

In this setting, the strong duality (Corollary 3.3) is written as

\inf_{p \in \bar{\nabla f (R^{n})}} ‖ p ‖ = \sup_{u \in R^{n} : ‖ u ‖ \leq 1} - f^{\infty} (u),

(3.25)

where

\bar{\nabla f (R^{n})}

is the closure of the gradient image

\nabla f (R^{n}) = {\nabla f (x) ∣ x \in R^{n}}

. This relation itself is deduced within Euclidean convex analysis as follows. Let

f^{*} : R^{n} \to R \cup {\infty}

be the Legendre-Fenchel conjugate of f:

f^{*} (p) ≔ \sup {〈 p, x 〉 - f (x) ∣ x \in R^{n}} (p \in R^{n}) .

Then, the gradient space $\bar{\nabla f (R^{n})}$ is equal to the closure $\bar{dom f^{*}}$ of the domain $dom f^{*} ≔ {p \in R^{n} ∣ f^{*} (p) < \infty}$ of $f^{*}$ . Indeed, this is because $\nabla f (R^{n}) \subseteq dom f^{*} \subseteq \bar{\nabla f (R^{n})}$ , where the first inclusion follows from $p = \nabla f (x) \Leftrightarrow f^{*} (p) = 〈 p, x 〉 - f (x)$ and the second from $f^{*} (p) < \infty \Leftrightarrow \inf_{x \in R^{n}} f (x) - 〈 p, x 〉 > - \infty \Rightarrow \inf_{x \in R^{n}} ‖ \nabla f (x) - p ‖ = 0$ . Also, it is known in convex analysis (Rockafellar [53, theorems 13.1 and 13.3]) that $f^{\infty}$ is equal to the support function of $dom f^{*}$ . Summarizing, it holds that

\bar{\nabla f (R^{n})} = \bar{dom f^{*}} = {p \in R^{n} ∣ 〈 u, p 〉 \leq f^{\infty} (u) (u \in R^{n})} .

(3.26)

In particular, the gradient space $\bar{\nabla f (R^{n})}$ is (closed) convex. Now, the equality in (3.25) is attained by the (uniquely determined) minimum-norm point $p^{*}$ of $\bar{\nabla f (R^{n})}$ and its negative direction $- p^{*} / ‖ p^{*} ‖$ ; see the proof of the next theorem. By Theorems 3.1 and 3.7, both $\nabla f (x (t))$ and $\nabla f (x_{i})$ converge to $p^{*}$ , and both $x (t)$ and $x_{i}$ converge to $- p^{*} / ‖ p^{*} ‖$ in cone topology.

Theorem 3.15.

Let $p^{*}$ denote the minimum-norm point of $\bar{\nabla f (R^{n})}$ . Suppose that $κ^{*} ≔ \inf_{x \in R^{n}} ‖ \nabla f (x) ‖ > 0$ .

(1) $\nabla f (x (t))$ converges to $p^{*}$ , and $x (t) / t$ converges to $- p^{*}$ .
(2) $\nabla f (x_{i})$ converges to $p^{*}$ , and $x_{i} / i$ converges to $- p^{*} / L$ .

Proof.

It suffices to show the claims for $x (t) / t$ and $x_{i} / i$ . We first verify that the unique minimizer of $f^{\infty}$ over the unit sphere is written as $- p^{*} / ‖ p^{*} ‖ ≕ u^{*}$ . Observe from the KKT-condition that ${p \in R^{n} ∣ 〈 u^{*}, p 〉 = f^{\infty} (u^{*})}$ is a supporting hyperplane of $\bar{\nabla f (R^{n})}$ at $p^{*}$ . Then, for any unit vector v, it holds that $f^{\infty} (v) \geq 〈 v, p^{*} 〉 \geq - ‖ p^{*} ‖ = 〈 u^{*}, p^{*} 〉 = f^{\infty} (u^{*})$ . In particular, $p^{*}$ and $u^{*} = - p^{*} / ‖ p^{*} ‖$ attain the equality in (3.25).

Then, by Theorem 3.1, we have $\lim_{t \to \infty} x (t) = - p^{*} / ‖ p^{*} ‖$ “in cone topology.” This implies that

\frac{- p^{*}}{‖ p^{*} ‖} = \lim_{t \to \infty} \frac{x (t) - x_{0}}{‖ x (t) - x_{0} ‖} = \lim_{t \to \infty} \frac{x (t)}{t} \frac{t}{d (x (t), x_{0})} = \lim_{t \to \infty} \frac{x (t)}{t} \frac{1}{‖ p^{*} ‖},

(3.27)

where the last equality follows from Proposition 3.4 with

‖ p^{*} ‖ = \lim_{t \to \infty} ‖ \nabla f (x (t)) ‖ = κ^{*}

. Thus, we have the latter part of Theorem 3.15 part (1). The latter part of part (2) is analogously shown by using Theorem 3.7 and Proposition 3.10 (for the sequence version of (3.27)). □

Because $- p^{*} = κ^{*} ξ^{*}$ , the expected convergence in Questions 3.6 and 3.12 holds in this case. We end this section with other interesting aspects.

3.3.1. Hessian Riemannian Gradient Flow.

Here we point out that the convergence of $\nabla f (x (t))$ to the minimum-norm point $p^{*}$ can also be explained via the theory of Hessian Riemannian gradient flows by Alvarez et al. [2]. Suppose for simplicity that the Hessian $\nabla^{2} f (x)$ is nonsingular for every $x \in R^{n}$ . Then, by the inverse mapping theorem applied to $x \mapsto \nabla f (x)$ (with the inverse $p \mapsto \nabla f^{*} (p)$ ), we see that $\nabla f (R^{n})$ is an open (convex) set.

Consider the continuous gradient flow $x (t)$ , and let $p (t) ≔ \nabla f (x (t))$ . One more differentiation in (3.1) yields

\dot{p} (t) = - \nabla^{2} f (x (t)) p (t) .

From $\nabla^{2} f (x (t)) = {(\nabla^{2} f^{*} (p (t)))}^{- 1}$ , we have the following ordinary differential equation (ODE) obeyed by $p (t)$ :

\dot{p} (t) = - {(\nabla^{2} f^{*} (p (t)))}^{- 1} p (t), p (0) = \nabla f (x_{0}) .

(3.28)

This can be interpreted as a gradient-flow ODE on a Riemannian manifold. Define a Riemannian metric ${〈, 〉}^{f}$ on open convex set $\nabla f (R^{n})$ by

{〈 u, v 〉}^{f} ≔ 〈 u, \nabla^{2} f^{*} (p) v 〉 (u, v \in T_{p} = R^{n}, p \in \nabla f (R^{n})) .

(3.29)

In this metric, the gradient $\nabla^{f} g (p)$ of $g : \nabla f (R^{n}) \to R$ is given by ${(\nabla^{2} f^{*} (p))}^{- 1} \nabla g (p)$ . Then (3.28) is viewed as the gradient flow of the squared-norm function $p \mapsto ‖ p ‖^{2} / 2$ :

\dot{p} (t) = - \nabla^{f} \frac{‖ p (t) ‖^{2}}{2}, p (0) = \nabla f (x_{0}) .

(3.30)

This is a particular instance of Hessian Riemannian gradient flow in Alvarez et al. [2]. Then, by Alvarez et al. [2, proposition 4.4], the solution $p (t)$ of (3.30) minimizes $‖ p ‖^{2} / 2$ over $\bar{\nabla f (R^{n})}$ in limit $t \to \infty$ , which proves $\lim_{t \to \infty} \nabla f (p (t)) = p^{*}$ , the first part of Theorem 3.15 part (1).

3.3.2. Mirror Descent.

On the other hand, the discrete version (Theorem 3.15 part (2)) can be explained from the framework of mirror descent (Nemirovsky and Yudin [49]), where we consult Bubeck [10, chapter 4] for it. Consider a general optimization problem

Min . g (p) s . t . p \in D,

(3.31)

where g is a differentiable convex function on an open convex set

D \subseteq R^{n}

. A mirror map

Φ : D \to R

is a differentiable strictly convex function such that

\nabla Φ : D \to R^{n}

is bijective and

‖ \nabla Φ (p) ‖ \to \infty

if p goes to the boundary of

D

. A basic form of mirror descent produces the sequence

p_{1}, p_{2}, \dots

D

according to the update

\nabla Φ (p_{i + 1}) ≔ \nabla Φ (p_{i}) - β_{i} \nabla g (p_{i}),

(3.32)

where

β_{i} > 0

is a step-size. It is well-known (see, e.g., Vishnoi [60, section 7.4]) that this update coincides with the proximal gradient descent relative to the Bregman divergence

D_{Φ} (q, p) ≔ Φ (q) - Φ (p) - 〈 \nabla Φ (p), q - p 〉

p_{i + 1} \in \underset{p \in D}{arg min} {g (p_{i}) + 〈 \nabla g (p_{i}), p - p_{i} 〉 + \frac{1}{β_{i}} D_{Φ} (p, p_{i})} .

(3.33)

Under several assumptions on $g, Φ$ , the solution $p_{i}$ (or the average solution $(1 / i) \sum_{j = 1}^{i} p_{j}$ or the best solution ever) is shown to converge to a minimizer of g (see, e.g., Lu et al. [46], Vishnoi [60, chapter 7], Bubeck [10, theorem 4.2], and Beck [6, section 9.2]).

Now, consider the setting $g (p) ≔ ‖ p ‖^{2} / 2$ and $D ≔ \nabla f (R^{n})$ . That is, (3.31) is the minimum-norm point problem on $\bar{\nabla f (R^{n})}$ . As a mirror map, we can choose the Legendre-Fenchel conjugate $Φ ≔ f^{*} ∣_{D}$ . Then, the update (3.32) becomes

\nabla f^{*} (p_{i + 1}) ≔ \nabla f^{*} (p_{i}) - β_{i} p_{i} .

(3.34)

Define $x_{i} \in R^{n}$ by $x_{i} ≔ \nabla f^{*} (p_{i})$ . Because $p_{i} = \nabla f (x_{i})$ , (3.34) becomes

x_{i + 1} ≔ x_{i} - β_{i} \nabla f (x_{i}) .

(3.35)

This is nothing but gradient descent, where the above Hessian Riemannian gradient flow is viewed as the continuous limit $\nabla^{2} f^{*} (p (t)) \dot{p} (t) = - p (t)$ of (3.34). Then, the first part of Theorem 3.15 part (2) can be deduced from Lu et al. [46, theorem 3.1]. Furthermore, an $O (1 / i)$ convergence rate is obtained if $f^{*} (p^{*}) < \infty$ ( $\Leftrightarrow$ $D_{f^{*}} (p^{*}, p) < \infty$ ). See Sakabe [54] for details.

It may be interesting to develop a manifold analogy of these observations, which may use the space $\nabla^{\infty} f (M) \subseteq C M^{\infty}$ in Hirai [30]. Related to this issue, in Section 4.1, we will consider an analogous gradient flow (Kirwan’s flow) in the complex projective space $P (V)$ .

3.3.3. Matrix Scaling and Geometric Programming.

The matrix scaling problem (Sinkhorn [58]) is this: For a given nonnegative matrix $A = (a_{i j}) \in R_{+}^{n \times n}$ , find positive diagonal matrices (scaling matrices) X, Y such that XAY approximates a doubly stochastic matrix, that is, $‖ (XAY) 1 - 1 ‖ \approx 0$ and $‖ {(XAY)}^{⊤} 1 - 1 ‖ \approx 0$ . Define a convex function $f_{A} : R^{n} \times R^{n} \to R$ by

f_{A} (x, y) ≔ \log \sum_{i, j} a_{i j} e^{x_{i} + y_{j}} - 1^{⊤} x / n - 1^{⊤} y / n (x \in R^{n}, y \in R^{n}) .

(3.36)

From $\nabla f_{A} (x, y) = (XAY 1 - 1, {(XAY)}^{⊤} 1 - 1) / n$ for $(X, Y) ≔ (e^{diag x}, e^{diag y}) \sqrt{n / \sum_{i, j} a_{i j} e^{x_{i} + y_{j}}}$ , the required scaling matrices X, Y are obtained from $(x, y)$ having small gradient-norm $‖ \nabla f_{A} (x, y) ‖$ . Particularly, such a point $(x, y)$ is obtained by minimizing $f_{A}$ .

This matrix scaling optimization falls into a more general class of convex optimization, called geometric programming, to which our results are applicable. A geometric program asks to minimize a function $f : R^{n} \to R$ of the following form:

f (x) = \log \sum_{ℓ = 1}^{N} a_{ℓ} e^{ω_{ℓ}^{⊤} x} (x \in R^{n}),

(3.37)

where

a_{ℓ} > 0

and

ω_{ℓ} \in R^{n}

for

ℓ = 1, 2, \dots, N

. It is well-known (see, e.g., Bürgisser et al. [11]) that

f is L-smooth convex with $L ≔ \max_{ℓ} ‖ ω_{ℓ} ‖^{2}$ , and
$\bar{\nabla f (R^{n})} = Conv {ω_{ℓ}}_{ℓ \in [N]}$ .

Therefore, with $L = 2$ , by Gradient Descent (3.11) applied to (3.36), the gradient sequence $\nabla f_{A} (x_{i})$ converges to the minimum-norm point $p^{*}$ of $Conv {e_{i} + e_{j} ∣ i, j : a_{i j} > 0}$ .

We will show in Section 4.2 for the general setting of operator scaling that the point $p^{*}$ and the limit of XAY are characterized by a canonical block-triangular form of A, known as (an extended version of) the DM-decomposition (Dulmage and Mendelsohn [17]; see also Murota [48, section 2.2.3]). A similar convergence property was earlier shown by Hayashi et al. [29] for the Sinkhorn algorithm (Sinkhorn [58]), the standard alternating minimization algorithm for (3.36), in which the gradient $\nabla f_{A} (x, y)$ and the scaled matrix XAY oscillate between two limit points described by the DM-decomposition.

4. Application

4.1. Norm-Minimization in Reductive Group Action

We consider the formulation of noncommutative optimization in Bürgisser et al. [13]; see also Hirai et al. [34]. Let $G \subseteq G L_{n}$ be a connected reductive algebraic group over $C$ , where we assume that it is self-adjoint $G = G^{†}$ (via conjugation (Wallach [61, theorem 3.13])). Its Lie algebra $g$ is the complexification of the Lie algebra $k$ of a maximal compact subgroup $K = G \cap U_{n}$ as $g = k + i k$ , where $i k \subseteq p_{n}$ . The inner product $〈, 〉$ on $g$ is defined by $〈 X, Y 〉 ≔ R e tr X Y^{†}$ . Let V be a finite-dimensional vector space over $C$ . Let $π : G \to G L (V)$ be a rational representation, where $Π$ denotes its Lie algebra representation: $Π (X) ≔ (d / d t) π (e^{t X}) ∣_{t = 0}$ . Consider a K-invariant Hermitian inner product $(,)$ and the associated norm $‖ \cdot ‖ = \sqrt{(\cdot, \cdot)}$ on V. The norm-minimization problem over the orbit $π (G) v$ of $v \in V ∖ {0}$ is given by

inf . ‖ π (g) v ‖ s . t . g \in G .

(4.1)

It turned out (e.g., Bürgisser et al. [13]) that this class of optimization problems has numerous, sometimes unexpected, applications and connections in various fields of mathematical sciences. The most fundamental problem is to ask whether the infimum is zero, that is, whether the origin 0 is in the orbit closure $\bar{π (G) v}$ . This is the semistability problem in GIT. The representation $π$ gives rise to a Hamiltonian action $(g, [v]) \mapsto [π (g) v]$ on the complex projective space $P (V)$ . The corresponding (modified⁴) moment map $μ : V \to i k$ is given by

〈 μ (v), H 〉 ≔ \frac{(v, Π (H) v)}{(v, v)} (v \in V, H \in i k),

(4.2)

where

μ

may be regarded as

P (V) \to i k

. The following theorem is fundamental:

Theorem 4.1

(Kempf-Ness Theorem, Hilbert-Mumford Criterion; see Georgoulas et al. [24, Theorem 8.5 (i), Theorem 12.4]). For $v \in V ∖ {0}$ , the following conditions are equivalent:

(i) $\inf_{g \in G} ‖ π (g) v ‖ = 0$ .
(ii) $\inf_{g \in G} ‖ μ (π (g) v) ‖ > 0$ .
(iii) There is a 1-parameter subgroup $t \mapsto e (t)$ of G such that $\lim_{t \to \infty} π (e (t)) v = 0$ .

The orbit $π (G) v$ in this situation is called unstable. Otherwise, it is called semistable. Accordingly, we call the 1-parameter subgroup $e (t)$ in (iii) a destabilizing 1-PSG.

The unstability corresponds to the lower-unboundedness of the Kempf-Ness function $F_{v}$ on the group G defined by

F_{v} (g) ≔ \frac{1}{2} \log ‖ π (g) v ‖^{2} (g \in G) .

(4.3)

Because $‖ \cdot ‖$ is K-invariant, the Kempf-Ness function is viewed as a function on the symmetric space $K \ G$ . By $‖ π (g) v ‖^{2} = (π (g^{†} g) v, v)$ and $K \ G ≃ P_{n} \cap G$ by $K g \mapsto g^{†} g$ , we may consider the following version of the Kempf-Ness function $f_{v}$ on $P_{n} \cap G$ :

f_{v} (x) ≔ \log (π (x) v, v) (x \in P_{n} \cap G) .

(4.4)

It is clear that $f_{v} (g^{†} g) = 2 F_{v} (g)$ . Then, $f_{v}$ is an L-smooth convex function such that the transported gradient of $f_{v}$ provides the moment map $μ$ :

Lemma 4.2

(Bürgisser et al. [13]).

(1) $f_{v}$ is $N_{π}^{2}$ -smooth convex, where $N_{π}$ is the maximum of the norm of a weight for $π$ .
(2) $τ_{x \to I} \nabla f_{v} (x) = μ (π (x^{1 / 2}) v)$ .

The second property (2) is implicit in Bürgisser et al. [13] and follows from $τ_{x \to I} H = x^{- 1 / 2} H x^{- 1 / 2}$ and ${〈 \nabla f_{v} (x), H 〉}_{x} = (d / d t) f_{v} (x^{1 / 2} e^{t x^{- 1 / 2} H x^{- 1 / 2}} x^{1 / 2}) ∣_{t = 0} = {〈 μ (π (x^{1 / 2}) v), x^{- 1 / 2} H x^{- 1 / 2} 〉}_{I}$ . In particular, for the Kempf-Ness function $f_{v}$ , the unboundedness is equivalent to the positivity of the minimum gradient-norm. Applying Corollary 3.3, we have:

Theorem 4.3.

$\inf_{g \in G} ‖ μ (π (g) v) ‖ = \sup_{ξ \in B_{I}} - f_{v}^{\infty} (ξ)$ . If $f_{v}^{\infty} (ξ) < 0$ , then $t \mapsto e^{t ξ}$ is a destabilizing 1-PSG.

Proof.

$\inf_{g \in G} ‖ μ (π (g) v) ‖ = \inf_{x \in P_{n} \cap G} ‖ μ (π (x^{1 / 2}) v) ‖$ follows from $μ (π (u g) v) = u μ (π (g) v) u^{†}$ for $u \in K$ , the polar decomposition $g = u x$ for $u \in K$ , $x \in P_{n} \cap G$ , and $x \in P_{n} \cap G \Rightarrow x^{a} \in P_{n} \cap G$ (because G is algebraic). The latter part can be seen from the definitions of the Kempf-Ness function (4.4) and the recession function (2.4). □

As seen below, this is a part of the theory of moment-weight inequality (Georgoulas et al. [24]), in which the recession function $f_{v}^{\infty}$ is essentially Mumford’s numerical invariant, called the $μ$ -weight; see Lemma 4.13.

Consider applying gradient descent to $f_{v}$ :

x_{k + 1} = \exp_{x_{k}} (- \frac{1}{L} \nabla f_{v} (x_{k})), x_{0} = I,

(4.5)

where

L ≔ N_{π}^{2}

. In this setting, updating group elements

g_{k}

in G may be more suitable:

g_{k + 1} = e^{- \frac{1}{2 L} μ (π (g_{k}) v)} g_{k}, g_{0} = I .

(4.6)

This is the first-order algorithm in Bürgisser et al. [13]. Each of the two updates (4.5) and (4.6) has its own advantage. Their relation is given by:

Lemma 4.4.

$x_{k} = g_{k}^{†} g_{k} .$

Proof.

If $g_{+} = e^{- \frac{1}{2 L} μ (π (g) v)} g$ and $g = u x^{1 / 2}$ for $u \in K$ , $x \in P_{n} \cap G$ , then it holds that $g_{+}^{†} g_{+} = g^{†} e^{- \frac{1}{L} μ (π (g) v)} g = x^{1 / 2} u^{†} e^{- \frac{1}{L} μ (π (u x^{1 / 2}) v)} u x^{1 / 2} = x^{1 / 2} e^{- \frac{1}{L} μ (π (x^{1 / 2}) v)} x^{1 / 2} = \exp_{x} - \frac{1}{L} \nabla f_{v} (x)$ , where the third inequality follows from $μ (π (u) v^{'}) = u μ (v^{'}) u^{†}$ and the fourth from Lemma 4.2 part (2). □

For the semistable case, Bürgisser et al. [13] showed its iteration complexity to compute $\inf_{g \in G} ‖ π (g) v ‖$ and to find $g \in G$ with $‖ μ (π (g) v) ‖ \approx 0$ . For the unstable case, our result (Theorem 3.7) implies that Gradient Descent (4.5) constructs a destabilizing 1-PSG in the limit, which is maximally destabilizing in the sense that it is obtained from the unique minimizer of $f_{v}^{\infty}$ over $S_{I} (P_{n} \cap G)$ (recall that $S_{I}$ denotes the unit sphere in $T_{I}$ ). This special 1-PSG is the same as the one shown by Kempf [42].

Theorem 4.5.

Suppose that $\inf_{g \in G} ‖ π (g) v ‖ = 0$ . Let $x_{k}$ be the sequence of (4.5), and let $u_{k}$ be the sequence defined by $x_{k} = e^{d (x_{k}, I) u_{k}}$ . Then $u_{k}$ converges to the unique minimizer $ξ^{*}$ of $f_{v}^{\infty}$ over $S_{I}$ , where $t \mapsto e^{t ξ^{*}}$ is a maximally destabilizing 1-PSG.

Unfortunately, because $f_{v}^{\infty}$ is not necessarily (upper semi)continuous, this theorem does not imply the algorithmic statement that $t \mapsto e^{t u_{k}}$ is a destabilizing 1-PSG for some large k. Therefore, we need a certain rounding idea to obtain a destabilizing 1-PSG from $u_{k}$ . We see in Section 4.2 that such a rounding is possible for the left-right action.

We also consider convergence of the moment-map sequence $μ (π (g_{k}) v)$ . Let $C_{π} \subseteq i k = T_{I} (P_{n} \cap G)$ denote a positive Weyl chamber: it is a convex cone with the property that for any $H \in i k$ there is a unique point in $C_{π}$ , denoted by $spec H$ , satisfying $spec H = k H k^{†}$ for some $k \in K$ . The moment polytope $Δ_{v} \subseteq C_{π}$ is defined as the closure of the image of $g \mapsto spec μ (π (g) v)$ :

Δ_{v} ≔ \bar{{spec μ (π (g) v) ∣ g \in G}} .

The convexity theorem by Guillemin and Sternberg [25], Guillemin and Sternberg [26], and Kirwan [43] says that it is a convex polytope.

Theorem 4.6

(Convexity Theorem (Guillemin and Sternberg [25], Guillemin and Sternberg [26], Kirwan [43])). $Δ_{v}$ is a convex polytope.

By Lemma 4.2 part (2), the polar decomposition $g = u x^{1 / 2}$ for $g \in G$ , $u \in K$ , $x \in P_{n} \cap G$ , and $μ (π (u x^{1 / 2}) v) = u μ (π (x^{1 / 2}) v) u^{†}$ , it holds that

\inf_{x \in P_{n} \cap G} ‖ \nabla f_{v} (x) ‖ = \inf_{x \in P_{n} \cap G} ‖ μ (π (x^{1 / 2}) v) ‖ = \inf_{g \in G} ‖ μ (π (g) v) ‖ = \inf_{g \in G} ‖ spec μ (π (g) v) ‖ = \inf_{p \in Δ_{v}} ‖ p ‖

(4.7)

By Theorem 3.7, we have the convergence of $spec μ (π (g_{k}) v) (= spec μ (π (x_{k}^{1 / 2}) v))$ along the gradient-descent trajectory, which is an analogue of Theorem 3.15 part (2).

Theorem 4.7.

Let $p^{*}$ be the minimum-norm point of $Δ_{v}$ , and let $H_{k}$ be the sequence defined by $x_{k} = e^{k H_{k} / L}$ . Suppose that $\inf_{g \in G} ‖ π (g) v ‖ = 0$ . Then, both $spec μ (π (g_{k}) v)$ and $spec (- H_{k})$ converge to $p^{*}$ for $k \to \infty$ .

Proof.

It suffices to show the claim for $H_{k}$ . By Proposition 3.10 part (2), Proposition 3.11, and Lemma 4.2 part (2), it holds that

\underset{k \to \infty}{lim inf} ‖ μ (π (x_{k}^{1 / 2}) v) + H_{k} ‖ = 0 .

Because $spec μ (π (x_{k}^{1 / 2}) v)$ converges to $p^{*}$ and $H_{k}$ converges (to $κ^{*} ξ^{*}$ ), it must hold that $spec (- H_{k})$ converges to $p^{*}$ . □

Question 3.12, if it is true, would imply the stronger convergence $\lim_{k \to \infty} μ (π (x_{k}^{1 / 2}) v) = - \lim_{k \to \infty} H_{k}$ .

4.1.1. Moment-Weight Inequality and Gradient Flow of Moment-Map Squared.

Clearly, via Theorem 3.1, the above results (Theorems 4.5 and 4.7) hold for the gradient flow:

\dot{x} (t) = - \nabla f_{v} (x (t)), x (0) = I .

(4.8)

Our consideration of this case falls into the theory of moment-weight inequality by Georgoulas et al. [24], which builds upon the earlier work by Kempf, Kirwan, Mumford, and Ness in GIT, and the recent work by Chen and Sun [15] in K-stability. Here, we briefly summarize the relation by deducing an important part of the theory from our results in Section 3.1. We use notation $g \cdot [v] ≔ [π (g) v]$ for the action on $P (V)$ . According to Georgoulas et al. [24, chapter 3], consider the gradient flow (Kirwan’s flow) of the squared-norm of the moment map on $P (V)$ :

\dot{ζ} (t) = - \nabla \frac{‖ μ (ζ (t)) ‖^{2}}{2}, ζ (0) = [v] .

(4.9)

This is the gradient flow of a real analytic function $ζ \mapsto ‖ μ (ζ) ‖^{2} / 2$ on a compact Riemannian manifold $P (V)$ (with respect to the Fubini-Study metric). By the standard argument of the Łojasiewicz gradient inequality, the limit of $ζ (t)$ exists.

Theorem 4.8

(Convergence Theorem (Georgoulas et al. [24, Theorem 3.3])). The limit $ζ_{\infty} ≔ \lim_{t \to \infty} ζ (t)$ exists.

Further, the limit $ζ_{\infty}$ attains the infimum of the moment-map norm over the orbit $G \cdot [v]$ in $P (V)$ .

Theorem 4.9

(Moment-Limit Theorem (Georgoulas et al. [24, Theorem 6.4])). Let $ζ (t)$ be the solution of (4.9), and let $ζ_{\infty} ≔ \lim_{t \to \infty} ζ (t)$ . Then it holds that

‖ μ (ζ_{\infty}) ‖ = \inf_{g \in G} ‖ μ (g \cdot [v]) ‖ .

(4.10)

The equality (4.10) can be understood from Theorem 3.1 as follows. Regard G as a Riemannian manifold by the right-invariant Riemannian metric ${〈 X, Y 〉}_{g} ≔ R e tr X g^{- 1} {(Y g^{- 1})}^{†}$ for $X, Y \in T_{g}, g \in G$ , and consider the gradient flow of $F_{v}$ on G:

\dot{g} (t) = - \nabla F_{v} (g (t)), g (0) = I .

(4.11)

Then, the solution $ζ (t)$ is obtained from the action of $g (t)$ as follows:

Theorem 4.10

(Georgoulas et al. [24, Theorem 4.1 (ii)]). The solution $ζ (t)$ of (4.9) is represented as $ζ (t) = g (t) \cdot [v]$ for the solution $g (t)$ of (4.11).

Proof sketch.

Define $φ : G \to P (V)$ by $g \mapsto g \cdot [v]$ . Then, by adapting Georgoulas et al. [24, (4.3)] with our notation, it holds that $d φ_{g} \nabla F_{v} (g) = \nabla \frac{‖ μ (g \cdot [v]) ‖^{2}}{2}$ . Thus, $(d / d t) (g (t) \cdot [v]) = (d / d t) φ (g (t)) = d φ_{g (t)} \dot{g} (t) = - d φ_{g (t)} \nabla F_{v} (g (t)) = - \nabla \frac{‖ μ (g (t) \cdot [v]) ‖^{2}}{2}$ , implying that $g (t) \cdot [v]$ is the solution $ζ (t)$ of (4.9). □

We can see that $\nabla F_{v} (g) = μ (π (g) v) g$ and (4.6) is the discretization (gradient descent) of (4.11). Analogously to Lemma 4.4, the relation between $x (t)$ and $g (t)$ is given by:

Lemma 4.11.

$x (2 t) = g {(t)}^{†} g (t) .$

Proof.

For $H \in T_{g^{†} g} (P_{n} \cap G)$ , it holds that ${〈 \nabla f_{v} (g^{†} g), H 〉}_{g^{†} g} = \frac{d}{d t} ∣_{t = 0} f_{v} (g^{†} e^{t g^{- †} H g^{- 1}} g)$ $= \frac{d}{d t} ∣_{t = 0} 2 F_{v} (e^{t g^{- †} H g^{- 1} / 2} g) = {〈 \nabla F_{v} (g), g^{- †} H 〉}_{g} = {〈 g^{†} \nabla F_{v} (g) + \nabla F_{v} {(g)}^{†} g, H / 2 〉}_{g^{†} g}$ . Hence, it holds that $2 \nabla f_{v} (g^{†} g) = g^{†} \nabla F_{v} (g) + \nabla F_{v} {(g)}^{†} g$ , and

\frac{d}{d s} (g {(s)}^{†} g (s)) = \dot{g} {(s)}^{†} g (s) + g {(s)}^{†} \dot{g} (s) = - \nabla F_{v} {(g (s))}^{†} g (s) - g {(s)}^{†} \nabla F_{v} (g (s)) = - 2 \nabla f_{v} (g {(s)}^{†} g (s)) .

Thus, $x (t) ≔ g {(t / 2)}^{†} g (t / 2)$ satisfies (4.8). □

Therefore, the moment-limit theorem (Theorem 4.9) follows from $‖ μ (ζ_{\infty}) ‖ = \lim_{t \to \infty} ‖$ $μ (π (g (t)) v) ‖ = \lim_{t \to \infty} ‖ μ (π (x {(t)}^{1 / 2}) v) ‖ = \inf_{x \in P_{n} \cap G} ‖ \nabla f_{v} (x (t)) ‖ = \inf_{g \in G} ‖ μ (g \cdot [v]) ‖$ . Accordingly, an analogue of Theorem 3.15 part (1) (or the continuous version of Theorem 4.7) is the following.

Theorem 4.12.

Let $p^{*}$ be the minimum-norm point of $Δ_{v}$ , and let $H (t)$ be the function defined by $x (t) = e^{t H (t)}$ . Suppose that $\inf_{g \in G} ‖ π (g) v ‖ = 0$ . Then, both $spec μ (π (g (t)) v)$ and $spec (- H (t))$ converge to $p^{*}$ for $t \to \infty$ .

Proof.

It suffices to show the claim for $H (t)$ . By Proposition 3.4 part (2), Proposition 3.5, and Lemma 4.2 part (2), it holds that

\underset{t \to \infty}{lim inf} ‖ μ (π (x {(t)}^{1 / 2}) v) + H (t) ‖ = 0 .

(4.12)

The rest is the same as in the proof of Theorem 4.7. □

Contrary to $g (t) \cdot [v]$ , we do not know whether $x {(t)}^{1 / 2} \cdot [v]$ converges.⁵ At least, if Question 3.6 is affirmative, then $μ (π (x {(t)}^{1 / 2}) v)$ converges. On the other hand, $μ (g (t) \cdot [v])$ converges to $μ (ζ_{\infty})$ , $- H (t)$ converges to $- H_{\infty} (= - κ^{*} ξ^{*})$ , and they have the same spectrum $p^{*}$ . Therefore, there is $u_{\infty} \in K$ such that $u_{\infty} μ (ζ_{\infty}) u_{\infty}^{†} = - H_{\infty}$ . This fact is a part of the generalized Kempf existence theorem (Georgoulas et al. [24, theorem 10.4, (10.9)]). In particular, Theorem 4.7 can be viewed as a discrete version of the moment-limit theorem, though we do not know whether $ζ_{k} ≔ g_{k} \cdot [v]$ converges.

We next explain the moment-weight inequality. The (restriction of) $μ$ -weight $w_{μ} : P (V) \times i k \to R \cup {\infty}$ is defined by

w_{μ} ([v], H) ≔ \lim_{t \to \infty} tr μ (π (e^{t H}) v) H ([v] \in P (V), H \in i k = T_{I} (P_{n} \cap G)),

(4.13)

where the existence of the limit is seen in the proof of the next lemma. The

μ

-weight is nothing but the recession function of

f_{v}^{\infty}

Lemma 4.13

(See Georgoulas et al. [24, Lemma 5.2]). $w_{μ} ([v], H) = f_{v}^{\infty} (H)$ .

Proof.

By recalling (2.4), it holds that $f_{v}^{\infty} (H) = \lim_{t \to \infty} (d / d t) f_{v} (e^{t H}) = \lim_{t \to \infty} tr μ (π (e^{t H}) v) H = w_{μ} ([v], H)$ , where the second equality follows from Lemma 4.2 part (2). □

We now state the main part of the theory of moment-weight inequality (for linear actions).

Theorem 4.14

(Moment-Weight Inequality (Georgoulas et al. [24, Theorems 6.7, 10.1, 10.2, 10.4])). It holds that

\inf_{g \in G} ‖ μ (g \cdot [v]) ‖ \geq \sup_{H \in i k ∖ {0}} \frac{- w_{μ} ([v], H)}{‖ H ‖} .

(4.14)

Suppose that $κ^{*} ≔ \inf_{g \in G} ‖ μ (g \cdot [v]) ‖ > 0$ . Then the equality in (4.14) holds, and the supremum is attained by unique $H^{*} \in i k$ with $‖ H^{*} ‖ = 1$ , obtained as follows: Let $H (t)$ be defined by $x (t) = e^{t H (t)}$ for solution $x (t)$ of (4.8). Then the limit $H_{\infty} ≔ \lim_{t \to \infty} H (t)$ exists, $‖ H_{\infty} ‖ = κ^{*}$ , and $H^{*} = H_{\infty} / ‖ H_{\infty} ‖$ .

From our convex optimization perspective, the moment-weight inequality (4.14) is explained by the weak duality (Lemma 2.2). The equality case is explained by the strong duality (Theorem 3.1), the gradient-flow construction of the unique minimizer of $f_{v}^{\infty}$ , and the formula of the velocity of escape (Proposition 3.4).

We finally state one well-known important uniqueness property of minimizers of the moment-map norm over $\bar{G \cdot [v]}$ ,

Theorem 4.15

(Second Ness Uniqueness Theorem (Georgoulas et al. [24, Theorem 6.5])). For $ζ, ζ^{'} \in \bar{G \cdot [v]}$ , if $‖ μ (ζ) ‖ = ‖ μ (ζ^{'}) ‖ = \inf_{g \in G} ‖ μ (g \cdot [v]) ‖$ , then $ζ^{'} \in K \cdot ζ$ .

In the next subsection, we characterize such minimizers for the left-right action.

4.2. Operator Scaling and Its Gradient-Flow Limit

Let $A = (A_{1}, A_{2}, \dots, A_{N}) \in C^{N (n \times m)}$ be an N-tuple of $n \times m$ matrices over $C$ . Let $p \in R_{+}^{n}$ , $q \in R_{+}^{m}$ be nonnegative vectors with the same sum $\sum_{i} p_{i} = \sum_{j} q_{j}$ , where p, q are arranged as

p_{1} \geq p_{2} \geq \dots \geq p_{n}, q_{1} \leq q_{2} \leq \dots \leq q_{m} .

(4.15)

The operator scaling problem, originally introduced by Gurvits [27] for $p = q = 1$ and extended by Franks [18] for general p, q, is to ask: For a given accuracy $ϵ \geq 0$ , find $g \in G L_{n}, h \in G L_{m}$ such that

{‖ \sum_{ℓ = 1}^{N} g A_{ℓ} h^{†} h A_{ℓ}^{†} g^{†} - diag p ‖}^{2} + {‖ \sum_{ℓ = 1}^{N} h A_{ℓ}^{†} g^{†} g A_{ℓ} h^{†} - diag q ‖}^{2} \leq ϵ^{2},

(4.16)

where the norm is the Frobenius norm. A matrix tuple A is said to be (approximately)

(p, q)

-scalable if for every positive

ϵ > 0

there are

g \in G L_{n}, h \in G L_{m}

satisfying (4.16). If some g, h satisfy (4.16) for

ϵ = 0

, then A is called exactly

(p, q)

-scalable, and

g A h^{†}

is called a

(p, q)

-scaling of A. The operator scaling is a quantum generalization of the matrix scaling, and turns out to have rich applications (see Franks [18], Garg and Oliveira [21], and Garg et al. [22, 23]). For simplicity, we assume that the left and right common kernels of A are both trivial:

\cap_{ℓ} \ker A_{ℓ} = {0}

and

\cap_{ℓ} \ker A_{ℓ}^{†} = {0}

In view of the previous section, the operator scaling is interpreted as the moment polytope membership of the left-right action $π : S L_{n} \times S L_{m} \to G L (C^{N (n \times m)})$ defined by

π (g, h) (B) ≔ g B h^{†} = (g B_{1} h^{†}, g B_{2} h^{†}, \dots, g B_{N} h^{†}),

(4.17)

where

B = (B_{1}, B_{2}, \dots, B_{N}) \in C^{N (n \times m)}

. A maximal compact subgroup K of

S L_{n} \times S L_{m}

is given by

S U_{n} \times S U_{m}

, and a K-invariant Hermitian product

〈, 〉

V = C^{N (n \times m)}

is given by

〈 B, C 〉 ≔ \sum_{ℓ = 1}^{N} tr B_{ℓ} C_{ℓ}^{†}

. From

Π (X, Y) (B) = X B + B Y^{†}

, we see that the moment map

μ : C^{N (n \times m)} \to p_{n}^{1} \times p_{m}^{1}

is given by

μ (B) = (μ_{1} (B), μ_{2} (B)) = \frac{1}{‖ B ‖^{2}} (\sum_{ℓ = 1}^{N} B_{ℓ} B_{ℓ}^{†}, \sum_{ℓ = 1}^{N} B_{ℓ}^{†} B_{ℓ}) - (\frac{1}{n} I, \frac{1}{m} I) .

(4.18)

A positive Weyl chamber is taken as the set of diagonal matrices $(diag p, diag q)$ with p, q satisfying (4.15) and $1^{⊤} p = 1^{⊤} q = 0$ . We regard it as a subset of $R^{n} \times R^{m}$ . Then the moment polytope $Δ_{A}$ consists of vectors of eigenvalues of $μ (B)$ over $B \in \bar{S L_{n} \cdot A \cdot S L_{m}}$ (the closure of the $S L_{n} \times S L_{m}$ -orbit of A). Comparing (4.18) with (4.16), we have:

Lemma 4.16.

A is $(p, q)$ -scalable if and only if $(p / c - 1 / n, q / c - 1 / m)$ belongs to $Δ_{A}$ , where $c ≔ \sum_{i} p_{i} = \sum_{j} q_{j}$ .

We consider the operator scaling problem for the most basic case: $(p, q) = (1 / n, 1 / m)$ . Then, it holds that

A is (1 / n, 1 / m) - scalable \Leftrightarrow (0, 0) \in Δ_{A} \Leftrightarrow \inf_{g, h} ‖ μ (g A h^{†}) ‖ = 0 .

Accordingly, the Kempf-Ness theorem (Theorem 4.1) links with the $(1 / n, 1 / m)$ -scaling problem, and is sharpened as follows. Let $S_{A}$ denote the family of pairs of vector subspaces $X \subseteq C^{n}$ , $Y \subseteq C^{m}$ such that $u^{⊤} A_{ℓ} \bar{v} = 0$ for all $u \in X$ , $v \in Y$ , $ℓ \in [N]$ . This is (essentially) the same as the family of independent subspace pairs in Franks [18] and Franks et al. [19]. Although $S_{A}$ is an infinite set, it turns out in Lemma 4.22 that a certain maximal subset $E_{A}$ of $S_{A}$ is a finite set.

Theorem 4.17

(Characterization of Scalability (Gurvits [27])). The following are equivalent:

(i) $\inf_{g \in S L_{n}, h \in S L_{m}} ‖ g A h^{†} ‖ > 0$ .
(ii) A is $(1 / n, 1 / m)$ -scalable.
(iii) For all $(X, Y) \in S_{A}$ , it holds that $(1 / n) \dim X + (1 / m) \dim Y \leq 1$ .

This theorem was originally stated for the case $n = m$ , in which the condition (iii) is simply written as $\dim Y \leq \dim \sum_{ℓ = 1}^{N} A_{k} Y$ for every subspace Y. A subspace violating this condition is called a shrunk subspace in Franks et al. [19], Garg et al. [23], and Ivanyos et al. [37, 38]. The $n \neq m$ generalization is straightforward and is included in more general results for the operator scaling with marginals by Franks [18].

A vector-space pair $(X, Y) \in S_{A}$ violating part (iii) actually gives rise to a destabilizing 1-PSG as follows: Choose $σ \in S U_{n}$ and $τ \in S U_{m}$ such that the first r rows of $σ$ span X and the first s rows of $τ$ span Y, where $(r, s) ≔ (\dim X, \dim Y)$ . Then one can see that $t \mapsto (e^{t diag (1_{[r]} - (r / n) 1)} σ, e^{t diag (1_{[s]} - (s / m) 1)} τ)$ is a destabilizing 1-PSG.

Further, the strict inequality in part (iii) brings exact scalability.

Theorem 4.18

(Exact Scalability (Gurvits [27])). If $(1 / n) \dim X + (1 / m) \dim Y < 1$ for all $(X, Y) \in S_{A}$ other than $({0}, C^{m})$ and $(C^{n}, {0})$ , then A is exactly $(1 / n, 1 / m)$ -scalable.

The exact case corresponds to the existence of g, h with $μ (g A h^{†}) = 0$ . By Lemma 4.2 part (2), this is the case where the Kempf-Ness function $f_{A}$ has an optimum (= a point of zero gradient). Then, Theorem 4.18 can be deduced from General Property (2.5) of the recession function $f_{A}^{\infty}$ (given explicitly in (4.21) below). Here, the Kempf-Ness function $f_{A} : P_{n}^{1} \times P_{m}^{1} \to R$ is written as

f_{A} (x, y) ≔ \log tr \sum_{ℓ = 1}^{N} x A_{ℓ} y A_{ℓ}^{†} (x \in P_{n}^{1}, y \in P_{m}^{1}) .

(4.19)

Lemma 4.19

(Bürgisser et al. [13]). $f_{A}$ is 2-smooth convex.

Now Theorem 4.3 (Corollary 3.3, or the moment-weight inequality (Theorem 4.14)) sharpens (ii) $\Leftrightarrow$ (iii) of Theorem 4.17 in the following min-max (inf-sup) form:

Theorem 4.20

(Duality Theorem for the Scalability Limit of Operator Scaling).

\begin{array}{l} \inf_{g, h} ‖ (\sum_{ℓ = 1}^{N} g A_{ℓ} h^{†} h A_{ℓ}^{†} g^{†} - \frac{1}{n} I, \sum_{ℓ = 1}^{N} h A_{ℓ}^{†} g^{†} g A_{ℓ} h^{†} - \frac{1}{m} I) ‖ \\ = \sup_{a, b, σ, τ} - \max {a_{i} + b_{j} ∣ \exists ℓ, {(σ A_{ℓ} τ^{†})}_{i j} \neq 0}, \end{array}

(4.20)

where the infimum in the left-hand side (LHS) is taken over all $g \in G L_{n}$ , $h \in G L_{m}$ with $‖ g A h^{†} ‖ = 1$ and the supremum in the right-hand side (RHS) is taken over all $σ \in S U_{n}$ , $τ \in S U_{m}$ , $a \in R^{n}$ , $b \in R^{m}$ with $‖ (a, b) ‖ \leq 1$ and $1^{⊤} a = 1^{⊤} b = 0$ .

Inspired by this formula, Hirai [32] obtained a cleaner formula by using the trace norm instead of the Frobenius norm.

Proof.

It suffices to show that $- f_{A}^{\infty}$ is equal to the objective function of the RHS in (4.20). Here $(G, H) \in p_{n}^{1} \times p_{m}^{1}$ is written as $(G, H) = (σ^{†} diag a σ, τ^{†} diag b τ)$ for $σ \in S U_{n}, τ \in S U_{m}$ , $a \in R^{n}$ , $b \in R^{m}$ with $1^{⊤} a = 1^{⊤} b = 0$ . Then we have

\begin{array}{l} f_{A}^{\infty} (G, H) = \lim_{t \to \infty} \frac{1}{t} \log tr \sum_{ℓ} e^{t G} A_{ℓ} e^{t H} A_{ℓ}^{†} = \lim_{t \to \infty} \frac{1}{t} \log \sum_{ℓ, i, j} | {(σ A_{ℓ} τ^{†})}_{i j} |^{2} e^{t (a_{i} + b_{j})} \\ = \max {a_{i} + b_{j} ∣ \exists ℓ, {(σ A_{ℓ} τ^{†})}_{i j} \neq 0}, \end{array}

(4.21)

where we used

\lim_{t \to \infty} \frac{1}{t} \log \sum_{k} e^{c_{k} + t d_{k}} = \max_{k} d_{k}

in the last equality. □

In the sequel, we assume that A is not $(1 / n, 1 / m)$ -scalable, and analyze the asymptotic behavior of gradient descent for $f_{A}$ :

(x_{k + 1}, y_{k + 1}) = \exp_{x_{k}, y_{k}} (- \frac{1}{L} \nabla f_{A} (x_{k}, y_{k})), (x_{0}, y_{0}) = (I, I),

(4.22)

where we let

L ≔ 2

by Lemma 4.19. The corresponding group update (4.6) in

S L_{n} \times S L_{m}

is given by

(g_{k + 1}, h_{k + 1}) = (e^{- \frac{1}{2 L} μ_{1} (g_{k} A h_{k}^{†})} g_{k}, e^{- \frac{1}{2 L} μ_{2} (g_{k} A h_{k}^{†})} h_{k}) (g_{0}, h_{0}) = (I, I) .

(4.23)

Then $(x_{k}, y_{k}) = (g_{k}^{†} g_{k}, h_{k}^{†} h_{k})$ by Lemma 4.4. We address the following problem.

Problem 4.21.

Characterize the following (A), (B), and (C):

(A) The limit of $spec μ (g_{k} A h_{k}^{†})$ (= the minimum-norm point of $Δ_{A}$ ).
(B) The limit of $(x_{k}, y_{k})$ in cone topology (= the unique minimizer of $f_{A}^{\infty}$ ).
(C) The limit of $[g_{k} A h_{k}^{†}]$ in $P (C^{N (n \times m)})$ (= the minimizer of the moment-map norm $‖ μ ‖$ over $\bar{[S L_{n} \cdot A \cdot S L_{m}]}$ ).

We show that these are characterized by a certain simultaneous block-triangular form of A. This block-triangular form is a vector-space generalization of the classical Dulmage-Mendelsohn decomposition (Dulmage and Mendelsohn [17]) for a bipartite graph and its associated matrix. We introduce our generalized DM-decomposition in a way analogous to Hayashi et al. [29, section 3] for the classical setting, where the essential idea of the construction can be partly found in Ito et al. [36]. Iwamasa et al. [39] pointed out that our DM-decomposition is a special case of the Harder-Narasimhan filtration for generalized Kronecker quivers.

Recall the family $S_{A}$ defined before Theorem 4.17. Define a map $ϕ : S_{A} \to R_{+}^{2}$ by

ϕ (X, Y) ≔ (\dim X, \dim Y) ((X, Y) \in S_{A}) .

Consider the convex hull $Conv ϕ (S_{A}) \subseteq R_{+}^{2}$ ; see the left of Figure 1. Let $E_{A}$ denote the subset of $(X, Y) \in S_{A}$ such that $ϕ (X, Y)$ is an extreme point of $Conv ϕ (S_{A})$ not equal to $(0, 0)$ .

**Figure 1. $Conv ϕ (S_{A})$ in $(y, x)$ -plane (left) and a DM-decomposition of A (right). The slope $n_{α} / m_{α}$ is increasing by the convexity of $Conv ϕ (S_{A})$ .**

Lemma 4.22.

For $(X, Y), (X^{'}, Y^{'}) \in E_{A}$ , if $\dim X \leq \dim X^{'}$ and $\dim Y \geq \dim Y^{'}$ , then $X \subseteq X^{'}$ and $Y \supseteq Y^{'}$ . In particular, $E_{A}$ is a finite set, and $ϕ$ is injective on $E_{A}$ .

Proof.

We may suppose that $ϕ (X, Y)$ and $ϕ (X^{'}, Y^{'})$ are equal or on an adjacent pair of extreme points. Observe $(X \cap X^{'}, Y + Y^{'}), (X + X^{'}, Y \cap Y^{'}) \in S_{A}$ . By the dimension identity of vector spaces, it holds that

ϕ (X \cap X^{'}, Y + Y^{'}) + ϕ (X + X^{'}, Y \cap Y^{'}) = ϕ (X, Y) + ϕ (X^{'}, Y^{'}) .

(4.24)

We claim that $X^{'} = X + X^{'}$ and $Y^{'} = Y \cap Y^{'}$ , which implies the statement. Otherwise, by (4.24), $ϕ (X \cap X^{'}, Y + Y^{'})$ or $ϕ (X + X^{'}, Y \cap Y^{'})$ goes beyond $Conv ϕ (S_{A})$ , which contradicts $(X \cap X^{'}, Y + Y^{'}), (X + X^{'}, Y \cap Y^{'}) \in S_{A}$ . □

Therefore, $E_{A} = {(X_{α}, Y_{α})}_{α = 0}^{θ}$ can be arranged as

\begin{array}{l} C^{n} = X_{0} \supset X_{1} \supset \dots \supset X_{θ} = {0}, \\ {0} = Y_{0} \subset Y_{1} \subset \dots \subset Y_{θ} = C^{m}, \end{array}

(4.25)

where

C^{n} \neq X_{1}

and

Y_{θ - 1} \neq C^{m}

follow from the assumption that the common left and right kernels of A are trivial. For each

α \in [θ]

, let

L_{A}^{α}

denote the subset consisting of

(X, Y) \in S_{A}

such that

ϕ (X, Y)

belongs to the edge between

ϕ (X_{α - 1}, Y_{α - 1})

and

ϕ (X_{α}, Y_{α})

. As in the proof of Lemma 4.22, we have:

Lemma 4.23.

If $(X, Y), (X^{'}, Y^{'}) \in L_{A}^{α}$ , then $(X + X^{'}, Y \cap Y^{'}), (X \cap X^{'}, Y + Y^{'}) \in L_{A}^{α}$ . In particular, $L_{A}^{α}$ is a modular lattice with respect to the partial order $(X, Y) ⪯ (X^{'}, Y^{'}) \Leftrightarrow X \supseteq X^{'}, Y \subseteq Y^{'}$ , where the minimum and maximum elements are given by $(X_{α - 1}, Y_{α - 1})$ and $(X_{α}, Y_{α})$ , respectively.

For each $α \in [θ]$ , consider a maximal chain (flag) of $L_{A}^{α}$ :

\begin{array}{l} X_{α - 1} = X_{α, 0} \supset X_{α, 1} \supset \dots \supset X_{α, θ_{α}} = X_{α}, \\ Y_{α - 1} = Y_{α, 0} \subset Y_{α, 1} \subset \dots \subset Y_{α, θ_{α}} = Y_{α}, \end{array}

where the length

θ_{α}

of the chain is uniquely determined by the Jordan-Dedekind chain condition. The union

\cup_{α = 1}^{θ} \cup_{β = 0}^{θ_{α}} {(X_{α, β}, Y_{α, β})}

is a maximal chain of the whole lattice

L_{A} ≔ \cup_{α = 1}^{θ} L_{A}^{α}

, and is called a DM-flag. Its subset

E_{A}

is called the coarse DM-flag, which is uniquely determined by A. From a DM-flag, we obtain a simultaneous block-upper-triangular form of A as follows. Consider

g \in G L_{n}

including, as row vectors, a basis of

X_{α, β}

for each

α, β

. Similarly, consider

h \in G L_{m}

including, as row vectors, a basis of

Y_{α, β}

for each

α, β

. Suppose that they are positioned in the last rows for g and first rows for h. Then, the matrices

B_{ℓ} = g A_{ℓ} h^{†}

are simultaneously block-triangularized, as in the right of Figure 1. We call

B = (B_{ℓ})

a DM-decomposition⁶ of A. When g (resp. h) is restricted to span only

X_{α}

(resp.

Y_{α}

), it is called a coarse DM-decomposition of A.

For abuse of notation, $X_{α}$ , $X_{α, β}$ , $Y_{α}$ , and $Y_{α, β}$ also denote the index sets of the corresponding rows and columns of B. Define ordered partitions $(I_{α})$ of [n], $(J_{α})$ of [m], and their refinements $(I_{α, β})$ , $(J_{α, β})$ by

I_{α} ≔ X_{α - 1} ∖ X_{α}, J_{α} ≔ Y_{α} ∖ Y_{α - 1} (α \in [θ]),

(4.26)

I_{α, β} ≔ X_{α, β - 1} ∖ X_{α, β}, J_{α, β} ≔ Y_{α, β} ∖ Y_{α, β - 1} (β \in [θ_{α}]) .

(4.27)

Let $\hat{B} = ({\hat{B}}_{ℓ})$ denote the matrix tuple of block-diagonal matrices obtained from $B_{ℓ}$ by replacing each (upper) off-diagonal block $B_{k} [I_{α, β}, J_{α^{'}, β^{'}}]$ $((α, β) \neq (α^{'}, β^{'}))$ with the zero matrix. We call $\hat{B}$ a diagonalized DM-decomposition of A. A diagonalized version of a coarse DM-decomposition is defined analogously.

Let $n_{α} ≔ | I_{α} |$ and $m_{α} ≔ | J_{α} |$ . By convexity of $Conv ϕ (S_{A})$ , it holds that

\frac{n_{1}}{m_{1}} < \frac{n_{2}}{m_{2}} < \dots < \frac{n_{θ}}{m_{θ}} .

(4.28)

Define $(p^{*}, q^{*}) \in R^{n} \times R^{m}$ by

p^{*} ≔ - \frac{1}{n} 1 + \frac{1}{C_{A}} \sum_{α = 1}^{θ} \frac{m_{α}}{n_{α} + m_{α}} 1_{I_{α}}, q^{*} ≔ - \frac{1}{m} 1 + \frac{1}{C_{A}} \sum_{α = 1}^{θ} \frac{n_{α}}{n_{α} + m_{α}} 1_{J_{α}},

(4.29)

where the constant

C_{A}

is defined by

C_{A} ≔ \sum_{α = 1}^{θ} \frac{n_{α} m_{α}}{n_{α} + m_{α}} \leq \frac{n m}{n + m},

(4.30)

where the inequality is seen from the concavity of the harmonic mean

(x, y) \mapsto 2 {(1 / x + 1 / y)}^{- 1}

. We see from (4.28)–(4.30) that

(p^{*}, q^{*})

belongs to the positive Weyl chamber:

p_{1}^{*} \geq p_{2}^{*} \geq \dots \geq p_{n}^{*}, q_{1}^{*} \leq q_{2}^{*} \leq \dots \leq q_{m}^{*}, 1^{⊤} p^{*} = 1^{⊤} q^{*} = 0 .

(4.31)

Recalling $P_{n}^{1} ≔ P_{n} \cap S L_{n}$ , define $(G^{*}, H^{*}) \in p_{n}^{1} \times p_{m}^{1} = T_{I, I} (P_{n}^{1} \times P_{m}^{1})$ by

G^{*} ≔ {(σ^{*})}^{†} diag (- p^{*}) σ^{*}, H^{*} ≔ {(τ^{*})}^{†} diag (- q^{*}) τ^{*},

(4.32)

where

σ^{*}

is a unitary matrix having a basis of

X_{α}

in the last

n_{α}

rows and

τ^{*}

is a unitary matrix having a basis of

Y_{α}

in the first

m_{α}

rows. By using these notions, we give a solution to Problem 4.21 parts (A) and (B):

Theorem 4.24.

(1) $(p^{*}, q^{*})$ is the minimum-norm point of $Δ_{A}$ , and
(2) $(G^{*}, H^{*}) / ‖ (G^{*}, H^{*}) ‖$ is the unique minimizer of $f_{A}^{\infty}$ over $S_{I, I} (P_{n}^{1} \times P_{m}^{1})$ , where it holds that
$‖ (p^{*}, q^{*}) ‖^{2} = - f_{A}^{\infty} (G^{*}, H^{*}) = \frac{1}{C_{A}} - \frac{1}{n} - \frac{1}{m} .$ (4.33)

Corollary 4.25.

Let $(g_{k}, h_{k})$ and $(x_{k}, y_{k})$ be the sequences in (4.23) and (4.22), respectively.

(1) $spec μ (g_{k} A h_{k}^{†})$ converges to $(p^{*}, q^{*})$ for $k \to \infty$ .
(2) $(x_{k}, y_{k})$ converges, in cone topology, to $(G^{*}, H^{*}) / ‖ (G^{*}, H^{*}) ‖$ . More precisely, the sequence $(G_{k}, H_{k})$ defined by $(x_{k}, y_{k}) = (e^{t G_{k} / L}, e^{t H_{k} / L})$ converges to $(G^{*}, H^{*})$ for $k \to \infty$ .

Proof of Theorem 4.24.

We first show (4.33). From the definitions of $(p^{*}, q^{*})$ and $C_{A}$ , we have

‖ (p^{*}, q^{*}) + (1 / n, 1 / m) ‖^{2} = \frac{1}{C_{A}^{2}} \sum_{α = 1}^{θ} \frac{n_{α} m_{α}^{2}}{{(n_{α} + m_{α})}^{2}} + \frac{m_{α} n_{α}^{2}}{{(n_{α} + m_{α})}^{2}} = \frac{1}{C_{A}^{2}} \sum_{α = 1}^{θ} \frac{n_{α} m_{α}}{n_{α} + m_{α}} = \frac{1}{C_{A}} .

By the last equation in (4.31), we have

‖ (p^{*}, q^{*}) ‖^{2} = ‖ (p^{*}, q^{*}) + (1 / n, 1 / m) ‖^{2} - ‖ (1 / n, 1 / m) ‖^{2} = 1 / C_{A} - 1 / n - 1 / m (> 0) .

On the other hand, $B = σ^{*} A {(τ^{*})}^{†}$ is a coarse DM-decomposition, that is, ${(σ^{*} A_{ℓ} {(τ^{*})}^{†})}_{i j} = 0$ for each $(i, j) \in I_{α} \times J_{α^{'}}$ with $α > α^{'}$ . By (4.21) in the proof of Theorem 4.20, the value of the recession function $f_{A}^{\infty} (G^{*}, H^{*})$ is given by

f_{A}^{\infty} (G^{*}, H^{*}) = \max {- p_{i}^{*} - q_{j}^{*} ∣ \exists ℓ, (i, j) \in I_{α} \times J_{α^{'}} : α \leq α^{'}, {(σ^{*} A_{ℓ} {(τ^{*})}^{†})}_{i j} \neq 0} .

(4.34)

Observe from (4.28)–(4.30) that

- p_{i}^{*} - q_{j}^{*} {\begin{array}{l} = 1 / n + 1 / m - 1 / C_{A} & if (i, j) \in I_{α} \times J_{α}, \\ < 1 / n + 1 / m - 1 / C_{A} & if (i, j) \in I_{α} \times J_{α^{'}} : α < α^{'}, \\ > 1 / n + 1 / m - 1 / C_{A} & if (i, j) \in I_{α} \times J_{α^{'}} : α > α^{'} . \end{array}

(4.35)

Hence, the maximum in (4.34) is attained by the index of any nonzero element of any diagonal block of $σ^{*} A_{ℓ} {(τ^{*})}^{†}$ , which implies $f_{A}^{\infty} (G^{*}, H^{*}) = 1 / n + 1 / m - 1 / C_{A}$ , and (4.33).

To complete the proof, it suffices to show $(p^{*}, q^{*}) \in Δ_{A}$ because $(p^{*}, q^{*})$ and $(G^{*}, H^{*}) / ‖ (G^{*}, H^{*}) ‖$ would attain $\inf_{(p, q) \in Δ_{A}} ‖ (p, q) ‖ = \sup_{(G, H) \in B_{I, I}} - f_{A}^{\infty} (G, H)$ . This is done in the next proposition. □

Proposition 4.26.

Let $\hat{B}$ be a diagonalized DM-decomposition of A.

(1) $\hat{B}$ is exactly $(p^{*} + 1 / n, q^{*} + 1 / m)$ -scalable.
(2) $[\hat{B}] \in \bar{[S L_{n} \cdot A \cdot S L_{m}]}$ .

In particular, it holds that $(p^{*}, q^{*}) \in Δ_{A}$ .

Proof.

Part (1). We first show:

Claim. $B [I_{α, β}, J_{α, β}]$ is exactly $(1 / | I_{α, β} |, 1 / | J_{α, β} |)$ -scalable.

Proof of Claim.

We can assume that A is already equal to a DM-decomposition B, where all $X_{α, β}$ , $Y_{α, β}$ are coordinate subspaces. Suppose indirectly that $B [I_{α, β}, J_{α, β}]$ is not exactly $(1 / | I_{α, β} |, 1 / | J_{α, β} |)$ -scalable. Then, by Theorem 4.18, there is nontrivial $(Z, W) \in S_{B [I_{α, β}, J_{α, β}]}$ such that $(1 / | I_{α, β} |) \dim Z + (1 / | J_{α, β} |) \dim W \geq 1$ . Then $(X_{α, β} + Z, Y_{α, β - 1} + W)$ belongs to $S_{A}$ . However, $ϕ (X_{α, β} + Z, Y_{α, β - 1} + W)$ goes beyond $Conv ϕ (S_{A})$ or lies on the interior of the segment between $ϕ (X_{α, β - 1}, Y_{α, β - 1})$ and $ϕ (X_{α, β}, Y_{α, β})$ . The former case is obviously impossible. The latter case is also impossible because of the maximality of the chain ${(X_{α, β}, Y_{α, β})}$ in $L_{A}$ . □

We observe from $n_{α} / m_{α} = | I_{α, β} | / | J_{α, β} |$ that $(m_{α}, n_{α})$ is a constant multiple of $(1 / | I_{α, β} |, 1 / | J_{α, β} |)$ . By the claim, for each $α, β$ , we can choose scaling matrices $g_{α, β}, h_{α, β}$ to make $B [I_{α, β}, J_{α, β}]$ an exact $(1 / {C_{A} (n_{α} + m_{α})}) (m_{α} 1, n_{α} 1)$ -scaling. Then, for $g ≔ \oplus_{α, β} g_{α, β}$ , $h ≔ \oplus_{α, β} h_{α, β}$ , the scaling $g \hat{B} h^{†}$ is a desired $(p^{*} + 1 / n, q^{*} + 1 / m)$ -scaling.

Part (2). Let B be a DM-decomposition of A, where $B \in S L_{n} \cdot A \cdot S L_{m}$ . For each $α, β$ and $t > 0$ , by $B [X_{α, β}, Y_{α, β}] = O$ , it holds that

{(e^{t diag 1_{X_{α, β}}} B e^{t diag 1_{Y_{α, β}} - 1})}_{i j} = {\begin{array}{l} B_{i j} e^{- t} & if i \notin X_{α, β}, j \notin Y_{α, β}, \\ B_{i j} & otherwise . \end{array}

(4.36)

Let $R ≔ \sum_{α, β} | X_{α, β} | / n$ and $S ≔ \sum_{α, β} (| Y_{α, β} | - m) / m$ . For $t > 0$ , define $a_{t} \in S L_{n}$ and $b_{t} \in S L_{m}$ by

a_{t} ≔ e^{- t R} e^{t diag \sum_{α, β} 1_{X_{α, β}}}, b_{t} ≔ e^{- t S} e^{t diag \sum_{α, β} 1_{Y_{α, β}} - 1} .

By (4.36), the scaling $a_{t} B b_{t}$ is written as

a_{t} B b_{t} = e^{- (R + S) t} (\hat{B} + E_{t})

for the diagonalized DM-decomposition

\hat{B}

of B and matrix

E_{t}

converging to zero for

t \to \infty

. This implies that

\lim_{t \to \infty} [a_{t} B b_{t}] = \lim_{t \to \infty} [\hat{B} + E_{t}] = [\hat{B}] \in \bar{[S L_{n} \cdot A \cdot S L_{m}]}

. Because

\hat{B}

admits an exact

(p^{*} + 1 / n, q^{*} + 1 / m)

-scaling,

B^{*} = g \hat{B} h^{†}

, by Lemma 4.16 and

1^{⊤} p^{*} = 1^{⊤} q^{*} = 0

, we conclude that

(p^{*}, q^{*}) \in Δ_{A}

. □

Now the sequence of the scaled matrices along the gradient-descent trajectory accumulates to the $S U_{n} \times S U_{m}$ -orbit of a diagonalized DM-decomposition $\hat{B}$ , providing a (partial) solution of Problem 4.21 part (C):

Theorem 4.27.

Let $\hat{B}$ be a diagonalized DM-decomposition of A, and let $B^{*}$ be a $(p^{*} + 1 / n, q^{*} + 1 / m)$ -scaling of $\hat{B}$ . Then $[g_{k} A h_{k}^{†}]$ accumulates to points in $[S U_{n} \cdot B^{*} \cdot S U_{m}]$ for $k \to \infty$ .

Proof.

It holds that $μ (B^{*}) = (diag p^{*}, diag q^{*})$ . Thus, $B^{*}$ attains the infimum of $‖ μ (B) ‖$ over $[B] \in \bar{[S L_{n} \cdot A \cdot S L_{m}]}$ , which is also the limit of $‖ μ (g_{k} A h_{k}^{†}) ‖$ . By the second Ness uniqueness theorem (Theorem 4.15), we have the claim. □

For the gradient flow $(g (t), h (t))$ of the Kempf-Ness function $F_{A}$ on the group $S L_{n} \times S L_{m}$ , because of the convergence theorem (Theorem 4.8), $[g (t) A h {(t)}^{†}]$ converges to a point $σ B^{*} τ^{†}$ for some $σ \in S U_{n}$ , $s τ \in S U_{m}$ .

Although $B^{*}$ is also a diagonalized DM-decomposition of A, it is not clear how to remove the unitary indeterminacy from $[g_{k} A h_{k}^{†}]$ and to extract the DM-structure of $B^{*}$ . This is possible for the coarse DM-structure as follows:

Theorem 4.28.

Let $(G_{k}, H_{k})$ be the sequence defined by $(x_{k}, y_{k}) = (e^{k G_{k} / L}, e^{k H_{k} / L})$ . Suppose that $G_{k} = σ_{k}^{†} diag a^{k} σ_{k}$ and $H_{k} = τ_{k}^{†} diag b^{k} τ_{k}$ for unitary matrices $σ_{k}$ , $τ_{k}$ and nondecreasing and nonincreasing vectors $a^{k}$ and $b^{k}$ , respectively. Then $σ_{k} A τ_{k}^{†}$ accumulates to coarse DM-decompositions. The convergence is linear in the following sense: there are $c > 0$ , $M > 0$ such that for all $k \geq M$ , $ℓ \in [N]$ it holds that

| {(σ_{k} A_{ℓ} τ_{k}^{†})}_{i j} | \leq e^{- c k} ((i, j) \in I_{α} \times J_{α^{'}} : α > α^{'}) .

Proof.

By Theorems 3.7 and 4.24 and Lemma 3.9 part (1), it holds that

\begin{array}{l} - \frac{1}{L} (\frac{1}{C_{A}} - \frac{1}{n} - \frac{1}{m}) = \lim_{k \to \infty} - \frac{‖ \nabla f_{A} (x_{k}, y_{k}) ‖^{2}}{L} \\ = \lim_{k \to \infty} f_{A} (x_{k + 1}, y_{k + 1}) - f_{A} (x_{k}, y_{k}) = \lim_{k \to \infty} \frac{f_{A} (x_{k}, y_{k})}{k}, \end{array}

(4.37)

where the final equality follows from (2.1) for

a_{k} ≔ f_{A} (x_{k + 1}, y_{k + 1}) - f_{A} (x_{k}, y_{k})

Because $e^{f_{A} (x_{k}, y_{k})} = tr \sum_{ℓ} x_{k} A_{ℓ} y_{k} A_{ℓ}^{†} = \sum_{ℓ, i, j} | {(σ_{k} A_{ℓ} τ_{k}^{†})}_{i j} |^{2} e^{(a_{i}^{k} + b_{j}^{k}) k / L},$ we have

\sum_{ℓ, i, j} | {(σ_{k} A_{ℓ} τ_{k}^{†})}_{i j} |^{2} e^{(a_{i}^{k} + b_{j}^{k}) k / L - f_{A} (x_{k}, y_{k})} = 1 .

Suppose that the index $(i, j)$ is in a lower triangular block. By $(a^{k}, b^{k}) \to_{k \to \infty} - (p^{*}, q^{*})$ (Corollary 4.25 part (2)) and (4.37), it holds that

\frac{(a_{i}^{k} + b_{j}^{k}) k / L - f_{A} (x_{k}, y_{k})}{k} \underset{k \to \infty}{\to} \frac{1}{L} (- p_{i}^{*} - q_{j}^{*} - \frac{1}{n} - \frac{1}{m} + \frac{1}{C_{A}}) > 0,

where the inequality follows from (4.35). Therefore, for some

c^{'} > 0

and

M^{'} > 0

, it holds that

(a_{i}^{k} + b_{j}^{k}) k / L - f_{A} (x_{k}, y_{k}) \geq c^{'} k

for all

k > M^{'}

. Then

| {(σ_{k} A_{ℓ} τ_{k}^{†})}_{i j} |^{2} e^{c^{'} k} \leq 1

for all

k \geq M^{'}

. □

Remark 4.29.

Suppose that $μ (x_{k}^{1 / 2} A y_{k}^{1 / 2})$ converges, or more strongly, the convergence of Question 3.12 is true. Then it holds that $\lim_{k \to \infty} ‖ μ (x_{k}^{1 / 2} A y_{k}^{1 / 2}) + (G_{k}, H_{k}) ‖ = 0 .$ This implies

\lim_{k \to \infty} ‖ μ (e^{diag a^{k} / 2} σ_{k} A τ_{k}^{†} e^{diag b^{k} / 2}) + (diag a^{k}, diag b^{k}) ‖ = 0 .

(4.38)

Because $(a^{k}, b^{k}) \to - (p^{*}, q^{*})$ , the scaling sequence $A^{(k)} ≔ (e^{diag a^{k} / 2} σ_{k} A τ_{k}^{†} e^{diag b^{k} / 2}) / ‖ g_{k} A h_{k} ‖$ accumulates to $(p^{*} + 1 / n, q^{*} + 1 / m)$ -scalings. From the coarse DM-structure of $σ_{k} A τ_{k}^{†}$ in the limit, one can see that $A^{(k)}$ accumulates to diagonalized coarse DM-decompositions. Although our numerical experiment supports such convergence, our results imply only ${lim inf}_{k \to \infty} = 0$ in (4.38).

We end this subsection with some implications of these results.

4.2.1. On Finding a Destabilizing 1-PSG.

Suppose that A is not $(1 / n, 1 / m)$ -scalable. Consider $(X^{*}, Y^{*}) \in E_{A}$ mapped to the extreme point $(r^{*}, s^{*})$ of $Conv ϕ (S_{A})$ with the property that it maximizes r among all extreme points $(r, s)$ maximizing $r + s$ . The subspace pair $(X^{*}, Y^{*})$ violates (iii) in Theorem 4.17 and is a special certificate of unscalability, called dominant in Franks et al. [19]. By Theorem 4.28, after a large number k of iterations, the last $r^{*}$ rows of $σ_{k}$ and the first $s^{*}$ rows of $τ_{k}$ become bases of an $ϵ$ -approximate dominant pair $(X_{ϵ}^{*}, Y_{ϵ}^{*})$ in the sense that $| u^{⊤} A_{ℓ} \bar{v} | \leq ϵ$ for all $ℓ$ and all unit vectors $u \in X_{ϵ}^{*}, v \in Y_{ϵ}^{*}$ . Franks et al. [19] devised a procedure to round such an $e^{- p (n, m, N, b)}$ -approximate dominant pair into the exact dominant pair $(X^{*}, Y^{*})$ , where p is a polynomial and b is the bit complexity of A. Hence, if we would establish global linear convergence in Theorem 4.28, a polynomial number of iterations of Gradient Descent (4.22) would suffice to recover the dominant pair and a destabilizing 1-PSG.

4.2.2. Matrix Scaling Case.

An $n \times m$ matrix $M = (a_{i j})$ is viewed as a matrix tuple $A = {(a_{i j} e_{i} e_{j}^{⊤})}_{i j : a_{i j} \neq 0}$ . Consider the left-right action on A, in which the group is restricted to the subgroup $S T_{n} \times S T_{m} \subseteq S L_{n} \times S L_{m}$ consisting of diagonal matrices. The corresponding scaling problem is nothing but the matrix scaling problem of the nonnegative matrix $(| a_{i j} |^{2})$ ; see Section 3.3. The above results are also applicable to this setting. Indeed, the gradient $\nabla f_{A}$ is a pair of diagonal matrices. Then, the gradient flow/descent belongs to the diagonal subspace in $P_{n}^{1} \times P_{m}^{1}$ , and is viewed as the gradient flow/descent for the geometric programming objective (3.36) in matrix scaling. Here, all subspaces $X_{α}, Y_{α}, X_{α, β}, Y_{α, β}$ are coordinate subspaces. Hence, a DM-decomposition B is obtained by row and column permutations, and is equivalent to the original (extended) DM-decomposition of M. In Remark 4.29, the unitary matrices $σ_{k}$ and $τ_{k}$ are permutation matrices, and all lower triangular blocks of $A^{(k)}$ become zero matrices after finitely many iterations. Also, all upper triangular blocks of $A^{(k)}$ converge to zero matrices. In particular, the expected convergence to the diagonalized DM-decomposition $\hat{B}$ is true. This convergence property is almost the same as the one for the Sinkhorn algorithm. Indeed, Hayashi et al. [29] showed that this limit (Sinkhorn limit) oscillates between the $(1, \sum_{α} (n_{α} / m_{α}) 1_{J_{α}})$ -scaling $B_{r}^{*}$ and $(\sum_{α} (m_{α} / n_{α}) 1_{I_{α}}, 1)$ -scaling $B_{c}^{*}$ of $\hat{B}$ .

4.2.3. On the Limit of the Operator Sinkhorn Algorithm.

This suggests an expectation of the limiting behavior of the operator Sinkhorn algorithm (Gurvits’ algorithm), the standard algorithm for the operator scaling problem. The operator Sinkhorn algorithm is viewed as alternating minimization of $f_{A} (x, y)$ , where each step scales $A \to g A$ with $μ (A) = (O, *)$ and $A \to A h^{†}$ with $μ (A) = (*, O)$ alternatively. When it is applied to the $(p^{*} + 1 / n, q^{*} + 1 / m)$ -scaling $B^{*}$ of a diagonalized DM-decomposition $\hat{B}$ , the resulting scaling sequence oscillates between the $(1, \sum_{α} (n_{α} / m_{α}) 1_{J_{α}})$ -scaling and $(\sum_{α} (m_{α} / n_{α}) 1_{I_{α}}, 1)$ -scaling of $B^{*}$ . With the view of Theorem 4.27 and the matrix scaling case above, it is reasonable to conjecture that it oscillates between orbits $U_{n} \cdot B_{r}^{*} \cdot U_{m}$ and $U_{n} \cdot B_{c}^{*} \cdot U_{m}$ , where $B_{r}^{*}$ (resp. $B_{c}^{*}$ ) is a $(1, \sum_{α} (n_{α} / m_{α}) 1_{J_{α}})$ -scaling (resp. $(\sum_{α} (m_{α} / n_{α}) 1_{I_{α}}, 1)$ -scaling) of $\hat{B}$ .

4.3. Kronecker Form of a Matrix Pencil

Finally, we discuss the special case of $N = 2$ , that is, $A = (A_{1}, A_{2})$ . In this case, A is naturally identified with a matrix pencil $s A_{1} + A_{2} \in C {(s)}^{n \times m}$ , where s is an indeterminate. Here we reveal a connection to the Kronecker canonical form of $s A_{1} + A_{2}$ , and suggest a new numerical method for finding the Kronecker structure based on gradient descent.

A pencil $s A_{1} + A_{2}$ is called regular if $n = m$ and $\det (s A_{1} + A_{2}) \neq 0$ for some $s \in C$ . Otherwise, $s A_{1} + A_{2}$ is called singular. For simplicity, we assume (again) that $\ker A_{1} \cap \ker A_{2} = {0}$ and $\ker A_{1}^{†} \cap \ker A_{2}^{†} = {0}$ . The Kronecker form is a canonical form of a (singular) pencil under transformation $(s A_{1} + A_{2}) \to g (s A_{1} + A_{2}) h^{†}$ by $g \in G L_{n}$ , $h \in G L_{m}$ . The standard reference of the Kronecker form is Gantmacher [20, chapter XII]; see also Murota [48, section 5.1.3] for its importance in systems analysis. For a positive integer $ϵ$ , define $ϵ \times (ϵ + 1)$ matrix $L_{ϵ}$ by

{(L_{ϵ})}_{i j} ≔ {\begin{array}{l} 1 & if j = i, \\ s & if j = i + 1, \\ 0 & otherwise . \end{array}

Theorem 4.30

(Kronecker Form; Gantmacher [20, Chapter XII]). There are $g \in G L_{n}, h \in G L_{m}$ such that

g (s A_{1} + A_{2}) h^{†} = L_{ϵ_{1}} \oplus L_{ϵ_{2}} \oplus \dots \oplus L_{ϵ_{c}} \oplus (s C + D) \oplus L_{η_{d}}^{†} \oplus L_{η_{d - 1}}^{†} \oplus \dots \oplus L_{η_{1}}^{†},

(4.39)

where

s C + D

is a regular pencil, and

ϵ_{1}, ϵ_{2}, \dots, ϵ_{c}

η_{1}, η_{2}, \dots, η_{d}

are positive integers determined as follows:

$ϵ_{j}$ is the minimum degree of a polynomial vector $x_{j} (s)$ in $\ker s A_{1} + A_{2}$ that is linearly independent from $x_{1} (s), x_{2} (s), \dots, x_{j - 1} (s)$ over $C (s)$ .
$η_{j}$ is the minimum degree of a polynomial vector $y_{j} (s)$ in $\ker {(s A_{1} + A_{2})}^{†}$ that is linearly independent from $y_{1} (s), y_{2} (s), \dots, y_{j - 1} (s)$ over $C (s)$ .

The indices $ϵ_{1} \leq \dots \leq ϵ_{c}, η_{1} \leq \dots \leq η_{d}$ , called the minimal indices, are uniquely determined. If $n = m$ and $s A_{1} + A_{2}$ is singular, then the Kronecker form has a zero block with the sum of row and column numbers greater than n. Therefore, by Theorem 4.17, we have:

Corollary 4.31.

A pencil $s A_{1} + A_{2}$ is regular if and only if $n = m$ and $(A_{1}, A_{2})$ is $(1 / n, 1 / n)$ -scalable.

We point out a further connection that the Kronecker form (4.39) is viewed as almost a DM-decomposition. Let b denote the number of diagonal blocks of $g A h^{†}$ in (4.39). For $γ \in [b]$ , let $I_{γ}$ and $J_{γ}$ denote the row and column index sets, respectively, of the $γ$ -th diagonal block of $g A h^{†}$ . Define $X_{γ}$ by the vector subspace spanned by the rows of g of indices in $I_{γ + 1} \cup I_{γ + 2} \cup \dots \cup I_{b}$ . Similarly, define $Y_{γ}$ by the vector subspace spanned by the rows of h having indices in $J_{1} \cup \dots \cup J_{γ}$ . We let $(X_{0}, Y_{0}) ≔ (C^{n}, {0})$ (and $(X_{b}, Y_{b}) = ({0}, C^{m}))$ . Suppose that $s C + D (= g (s A_{1} + A_{2}) h^{†} [I_{c + 1}, J_{c + 1}])$ exists and is an $n_{0} \times n_{0}$ upper triangular matrix. Let $Z_{β}$ denote the vector space spanned by the rows of g having the last $n_{0} - β$ indices in $I_{c + 1}$ , and let $W_{β}$ denote the vector space spanned by the rows of h having the first $β$ indices in $J_{c + 1}$ . Let $X_{c, β} ≔ X_{c + 1} + Z_{β}$ and $Y_{c, β} ≔ Y_{c} + W_{β}$ , where + is the direct sum. Consider all indices $γ$ with $(| I_{γ} |, | J_{γ} |) \neq (| I_{γ + 1} |, | J_{γ + 1} |)$ , and suppose that they are ordered as $0 ≕ γ_{0} < γ_{1} < \dots < γ_{θ} ≔ b$ .

Proposition 4.32.

(1) ${(X_{γ_{α}}, Y_{γ_{α}})}_{α = 0}^{θ}$ is the coarse DM-flag of $(A_{1}, A_{2})$ .
(2) Suppose that $s C + D$ is an $n_{0} \times n_{0}$ upper triangular pencil. Then the union of ${(X_{γ}, Y_{γ})}_{γ = 0}^{b}$ and ${(X_{c, β}, Y_{c, β})}_{β = 1}^{n_{0} - 1}$ is a DM-flag of $(A_{1}, A_{2})$ .

Proof.

Part (1). Suppose that $E_{A}$ consists of $(X_{α}^{'}, Y_{α}^{'})$ for $α = 0, 1, 2 \dots, θ^{'}$ , arranged as in (4.25). We show $(X_{α}^{'}, Y_{α}^{'}) = (X_{γ_{α}}, Y_{γ_{α}})$ for $α = 0, 1, 2 \dots, θ^{'} = θ$ . Consider the convex hull $K_{A}$ of $(0, 0)$ and $ϕ (X_{γ}, Y_{γ})$ for all $γ$ . Then $K_{A}$ belongs to $Conv ϕ (S_{A})$ , and the maximal faces of $K_{A}$ are composed of the line segments connecting points $ϕ (X_{γ}, Y_{γ})$ from $γ = 0$ to b with bending points $ϕ (X_{γ_{α}}, Y_{γ_{α}})$ .

We show $K_{A} = Conv ϕ (S_{A})$ by induction on the number b of diagonal blocks. Consider the base case $b = 1$ where the Kronecker form consists of a single block. It suffices to show that $E_{A} = {(C^{n}, 0), (0, C^{m})}$ . Suppose that $s A_{1} + A_{2}$ is an $n_{0} \times n_{0}$ regular pencil $s C + D$ . By regularity, there is no $(X, Y) \in S_{A}$ with $\dim X + \dim Y > n_{0}$ (otherwise $s A_{1} + A_{2}$ is singular over $C (s)$ ). This means no point in $ϕ (S_{A})$ beyond the line segment between $(n_{0}, 0)$ and $(0, n_{0})$ . Therefore, we have $E_{A} = {(C^{n}, 0), (0, C^{m})}$ . Suppose that $s A_{1} + A_{2} = L_{n}$ . Suppose to the contrary that there is $(X, Y) \in S_{A}$ with $\dim X / n + \dim Y / (n + 1) > 1$ . By basis change, we may assume that

s A_{1} + A_{2} = (\begin{matrix} B & C \\ O & D \end{matrix}),

where O is the

r \times s

zero matrix for

(r, s) ≔ (\dim X, \dim Y)

. By

r \geq 1

and

r + s \geq n + 1

, B is a pencil of

n - r

rows and s columns with

s > n - r

. Then

\ker B

contains a polynomial vector with degree at most

n - r < n

; use Cramer’s formula to see this. Necessarily,

\ker s A_{1} + A_{2}

also has such a polynomial vector. This is a contradiction to Theorem 4.30 (

ϵ_{1} = ϵ_{c} = n

). The case

s A_{1} + A_{2} = L_{n}^{†}

is similar.

Consider a general case of $b \geq 2$ . We can choose $γ^{*}, α^{*}$ such that $0 < γ^{*} < b$ , $0 < α^{*} < θ^{'}$ , and the line segment between $ϕ (X_{γ^{*}}, Y_{γ^{*}})$ and $ϕ (X_{α^{*}}^{'}, Y_{α^{*}}^{'})$ meets with $K_{A}$ only at $ϕ (X_{γ^{*}}, Y_{γ^{*}})$ . Consider $(U, V) ≔ (X_{γ^{*}} + X_{α^{*}}^{'}, Y_{γ^{*}} \cap Y_{α^{*}}^{'})$ and $(U^{'}, V^{'}) ≔ (X_{γ^{*}} \cap X_{α^{*}}^{'}, Y_{γ^{*}} + Y_{α^{*}}^{'})$ . By the construction and (4.24), one of $ϕ (U, V)$ and $ϕ (U^{'}, V^{'})$ is outside of $K_{A}$ . Suppose that $ϕ (U, V) \notin K_{A}$ . Consider the submatrix $A^{'} ≔ (s A_{1} + A_{2}) [\cup_{γ = 1}^{γ^{*}} I_{γ}, \cup_{γ = 1}^{γ^{*}} J_{γ}]$ , which is also a Kronecker form with a smaller number of blocks. From $U \supseteq X_{γ^{*}}$ , $V \subseteq Y_{γ^{*}}$ , and $ϕ (U, V) \notin K_{A}$ , it necessarily holds that $K_{A^{'}} \neq Conv ϕ (S_{A^{'}})$ . However, this is a contradiction to the inductive assumption. The case $ϕ (U^{'}, V^{'}) \notin K_{A}$ is similar; consider the sub-Kronecker form $(s A_{1} + A_{2}) [\cup_{γ = γ^{*} + 1}^{b} I_{γ}, \cup_{γ = γ^{*} + 1}^{b} J_{γ}]$ .

Part (2). Observe that all integer points in the maximal faces of $Conv ϕ (S_{A})$ are obtained by the images of $(X_{γ}, Y_{γ})$ and $(X_{c, β}, Y_{c, β})$ . This implies that ${(X_{γ}, Y_{γ})}_{γ} \cup {(X_{c, β}, Y_{c, β})}_{β}$ is a maximal chain of $L_{A}$ . □

The matrix pencil $g (s A_{1} + A_{2}) h^{†}$ corresponding to a coarse DM-decomposition $g (A_{1}, A_{2}) h^{†}$ , which we call a coarse Kronecker triangular form, is a refinement of a quasi-Kronecker triangular form in Berger and Trenn [7] and generalized Schur form in Demmel and Kåragström [16] and Van Dooren [59] if g, h are unitary matrices and $s C + D$ is triangular.

Then, the convergence (Theorem 4.28) of Gradient Descent (4.22) can be applied as follows:

Theorem 4.33

(Convergence to a Coarse Kronecker Triangular Form). Let $(x_{k}, y_{k})$ be a solution of (4.22). Decompose $x_{k} = σ_{k}^{†} e^{diag a^{k}} σ_{k}$ and $y_{k} = τ_{k}^{†} e^{diag b^{k}} τ_{k}$ , where $σ_{k}$ and $τ_{k}$ are unitary matrices, and $a^{k}$ and $b^{k}$ are nondecreasing and nonincreasing vectors, respectively. Then, $σ_{k} (s A_{1} + A_{2}) τ_{k}^{†}$ accumulates to coarse Kronecker triangular forms, where the convergence is linear in the same sense as in Theorem 4.28.

A coarse Kronecker triangular form is enough for determining the structure of the Kronecker form. Indeed, each (nonsquare) rectangular diagonal block is a $k ν \times k (ν + 1)$ or $k (ν + 1) \times k ν$ matrix for some integers $k, ν$ , from which all minimal indices $ϵ_{1}, ϵ_{2}, \dots, ϵ_{c}$ , $η_{1}, η_{2}, \dots, η_{d}$ can be identified.

The above theorem suggests an iterative method for determining the minimal indices of a singular pencil, which is based on simple gradient descent and is conceptually different from the existing algorithms, for example, Demmel and Kåragström [16] and Van Dooren [59]. It is an interesting future direction to develop a numerically stable algorithm based on this approach.

Acknowledgments

The authors thank Shin-ichi Ohta, Harold Nieuwboer, and Michael Walter for discussion; Yuni Iwamasa, Taihei Oki, and Tasuku Soma for comments; and Shun Sato for suggesting Sanz Serna and Zygalakis [57]. The authors also thank the referees for numerous helpful comments. The conference version appeared as Hirai and Sakabe [33].

Endnotes

¹ Proof sketch: Let $α (t) ≔ \exp_{x_{0}} t u$ and $β (t) ≔ \exp_{y_{0}} t v$ , and define $u_{t} \in S_{x_{0}}$ by $\exp_{x_{0}} d (x_{0}, β (t)) u_{t} = β (t)$ . By convexity of f along the geodesic between $x_{0}$ and $β (t)$ , it holds that $f (\exp_{x_{0}} s u_{t}) - f (x_{0}) \leq (s / d (x_{0}, β (t))) (f (β (t)) - f (x_{0}))$ for $s \in [0, d (x_{0}, β (t))]$ . By the triangle inequality, we have $(f (β (t)) - f (x_{0})) / d (x_{0}, β (t)) \leq \max_{σ \in {- 1, 1}} (f (β (t)) - f (x_{0})) / (t + σ d (x_{0}, y_{0})) \to f_{y_{0}}^{\infty} (v)$ for $t \to \infty$ . By the CAT(0)-inequality on the geodesic triangle of vertices $x_{0}, α (t), β (t)$ and by $d (α (t), β (t))$ being bounded, it holds that $\exp_{x_{0}} s u_{t} \to α (s)$ for $t \to \infty$ . Thus, we have $f_{y_{0}}^{\infty} (v) \geq (f (α (s)) - f (x_{0})) / s \to_{s \to \infty} f_{x_{0}}^{\infty} (u)$ . By symmetry, it holds that $f_{x_{0}}^{\infty} (u) \geq f_{y_{0}}^{\infty} (v)$ , and hence, $f_{x_{0}}^{\infty} (u) = f_{y_{0}}^{\infty} (v)$ .

² If $f^{\infty} (ξ) = f^{\infty} (ξ^{'}) = c < 0$ , then by convexity, it holds that $f^{\infty} (m) \leq (f^{\infty} (ξ) + f^{\infty} (ξ^{'})) / 2 = c$ for the midpoint m of $ξ$ and $ξ^{'}$ in $C M^{\infty}$ , and by $‖ m ‖ < 1$ it holds that $f^{\infty} (m / ‖ m ‖) = f^{\infty} (m) / ‖ m ‖ < c$ .

³ It is found in Ambrosio et al. [3, theorem 4.0.4] for the general setting of gradient flows in metric spaces. For our manifold case, it is an easy consequence of the first variation formula (Sakai [56, proposition 2.2]) as follows: $(d / d t) d {(ϕ_{t} (x), ϕ_{t} (y))}^{2} / 2 = 〈 - \nabla f (ϕ_{t} (y)), \dot{γ} (1) 〉 -$ $〈 - \nabla f (ϕ_{t} (x)), \dot{γ} (0) 〉 = - \int_{0}^{1} (d / d t) 〈 \nabla f (γ (s)), \dot{γ} (s) 〉 d s = - \int_{0}^{1} 〈 \nabla^{2} f (γ (s)) \dot{γ} (s), \dot{γ} (s) 〉 d s \leq 0$ , where $γ : [0, 1] \to M$ is a geodesic from $ϕ_{t} (x)$ to $ϕ_{t} (y)$ .

⁴ The formal definition of the moment map is given by $[v] \mapsto - i μ ([v]) \in k$ (Georgoulas et al. [24, lemma 8.2]).

⁵ In the earlier versions of this paper, the convergence of $x {(t)}^{1 / 2} \cdot [v]$ was stated but the proof was false.

⁶ The classical DM-decomposition restricts $S_{A}$ to coordinate subspaces and $L_{A}$ to the sublattice of the coordinate subspaces X, Y maximizing $\dim X + \dim Y$ , where g, h are chosen as permutation matrices. In this setting, a block-triangular form obtained by using the maximal chain of the entire $L_{A}$ was considered by N. Tomizawa (unpublished) in the development of principal partitions in the 1970’s; see Hayashi et al. [29, section 3]. For this reason, our decomposition may be more precisely called a DMT-decomposition.

References

[1] Allen-Zhu Z, Garg A, Li Y, Oliveira R, Wigderson A (2018) Operator scaling via geodesically convex optimization, invariant theory and polynomial identity testing. Diakonikolas I, Kempe D, eds. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. STOC 2018 (Association for Computing Machinery, New York), 172–181.Google Scholar
[2] Alvarez F, Bolte J, Brahic O (2004) Hessian Riemannian gradient flows in convex programming. SIAM J. Control Optim. 43(2):477–501.Crossref, Google Scholar
[3] Ambrosio L, Gigli N, Savaré G (2008) Gradient Flows: In Metric Spaces and in the Space of Probability Measures, 2nd ed. (Birkhäuser, Basel, Switzerland).Google Scholar
[4] Auslender A (1997) How to deal with the unbounded in optimization: Theory and algorithms. Math. Programming 79(1–3):3–18.Crossref, Google Scholar
[5] Bačák M (2014) Convex Analysis and Optimization in Hadamard Spaces (De Gruyter, Berlin).Crossref, Google Scholar
[6] Beck A (2017) First-Order Methods in Optimization (Society for Industrial and Applied Mathematics, Philadelphia).Crossref, Google Scholar
[7] Berger T, Trenn S (2012) The quasi-Kronecker form for matrix pencils. SIAM J. Matrix Anal. Appl. 33(2):336–368.Crossref, Google Scholar
[8] Boumal N (2023) An Introduction to Optimization on Smooth Manifolds (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[9] Bridson MR, Haefliger A (1999) Metric Spaces of Non-Positive Curvature (Springer-Verlag, Berlin).Crossref, Google Scholar
[10] Bubeck S (2015) Convex optimization: Algorithms and complexity. Foundations Trends Machine Learn. 8(3–4):231–357.Crossref, Google Scholar
[11] Bürgisser P, Li Y, Nieuwboer H, Walter M (2020) Interior-point methods for unconstrained geometric programming and scaling problems. Preprint, submitted August 27, https://arxiv.org/abs/2008.12110.Google Scholar
[12] Bürgisser P, Franks C, Garg A, Oliveira R, Walter M, Wigderson A (2018) Efficient algorithms for tensor scaling, quantum marginals, and moment polytopes. Thorup M, ed. Proc. 59th IEEE Annual Sympos. Foundations Comput. Sci. FOCS 2018 (IEEE, New York), 883–897.Google Scholar
[13] Bürgisser P, Franks C, Garg A, Oliveira R, Walter M, Wigderson A (2019) Towards a theory of non-commutative optimization: Geodesic 1st and 2nd order methods for moment maps and polytopes. Proc. 60th IEEE Annual Sympos. Foundations Comput. Sci. FOCS 2019 (IEEE, New York), 845–861.Google Scholar
[14] Caprace P-E, Lytchak A (2010) At infinity of finite-dimensional CAT(0) spaces. Math. Ann. 346(1):1–21.Crossref, Google Scholar
[15] Chen X, Sun S (2014) Calabi flow, geodesic rays, and uniqueness of constant scalar curvature Kähler metrics. Ann. Math. 180(2):407–454.Crossref, Google Scholar
[16] Demmel J, Kåragström B (1993) The generalized Schur decomposition of an arbitrary pencil $A - λ B$ —Robust software with error bounds and applications. I. Theory and algorithms. ACM Trans. Math. Software 19(2):160–174.Crossref, Google Scholar
[17] Dulmage AL, Mendelsohn NS (1958) Coverings of bipartite graphs. Canadian J. Math. 10:517–534.Crossref, Google Scholar
[18] Franks C (2018) Operator scaling with specified marginals. Diakonikolas I, Kempe D, eds. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. (STOC 2018) (Association for Computing Machinery, New York). 190–203.Google Scholar
[19] Franks C, Soma T, Goemans MX (2023) Shrunk subspaces via operator Sinkhorn iteration. Bansal N, Nagarajan V, eds. Proc. 2023 Annual ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 1655–1668.Google Scholar
[20] Gantmacher FR (1959) The Theory of Matrices, vol. 1–2 (Chelsea Publishing Co., New York).Google Scholar
[21] Garg A, Oliveira R (2018) Recent progress on scaling algorithms and applications. Bull. Eur. Assoc. Theoret. Comput. Sci. (125):14–49.Google Scholar
[22] Garg A, Gurvits L, Oliveira R, Wigderson A (2018) Algorithmic and optimization aspects of Brascamp-Lieb inequalities, via operator scaling. Geometric Funct. Anal. 28(1):100–145.Crossref, Google Scholar
[23] Garg A, Gurvits L, Oliveira R, Wigderson A (2020) Operator scaling: Theory and applications. Foundations Comput. Math. 20(2):223–290.Crossref, Google Scholar
[24] Georgoulas V, Robbin JW, Salamon DA (2021) The Moment-Weight Inequality and the Hilbert-Mumford Criterion—GIT from the Differential Geometric Viewpoint, Lecture Notes in Mathematics, vol. 2297 (Springer, Cham, Switzerland).Google Scholar
[25] Guillemin V, Sternberg S (1982) Convexity properties of the moment mapping. Inventiones Math. 67(3):491–513.Crossref, Google Scholar
[26] Guillemin V, Sternberg S (1984) Convexity properties of the moment mapping. II. Inventiones Math. 77(3):533–546.Crossref, Google Scholar
[27] Gurvits L (2004) Classical complexity and quantum entanglement. J. Comput. System Sci. 69(3):448–484.Crossref, Google Scholar
[28] Hamada M, Hirai H (2021) Computing the nc-rank via discrete convex optimization on CAT(0) spaces. SIAM J. Appl. Algebra Geometry 5(3):455–478.Crossref, Google Scholar
[29] Hayashi K, Hirai H, Sakabe K (2023) Finding Hall blockers by matrix scaling. Math. Oper. Res. 49(4):2166–2179.Link, Google Scholar
[30] Hirai H (2024) Convex analysis on Hadamard spaces and scaling problems. Foundations Comput. Math. 24(6):1979–2016.Crossref, Google Scholar
[31] Hirai H (2025) Generalized gradient flows in Hadamard manifolds and convex optimization on entanglement polytopes. Preprint, submitted November 15, https://arxiv.org/abs/2511.12064v1.Google Scholar
[32] Hirai H (2026) A scaling characterization of nc-rank via unbounded gradient flow. Linear Algebra Appl. 730:525–545.Crossref, Google Scholar
[33] Hirai H, Sakabe K (2024) Gradient descent for unbounded convex functions on Hadamard manifolds and its applications to scaling problems. Proc. 65th IEEE Sympos. Foundations Comput. Sci. (FOCS 2024) (IEEE, New York), 2387–2402.Google Scholar
[34] Hirai H, Nieuwboer H, Walter M (2023) Interior-point methods on manifolds: Theory and applications. Proc. 64th IEEE Sympos. Foundations Comput. Sci. (FOCS 2023) (IEEE, New York), 2021–2030.Google Scholar
[35] Hiriart-Urruty J-B, Lemaréchal C (2001) Fundamentals of Convex Analysis (Springer-Verlag, Berlin).Crossref, Google Scholar
[36] Ito H, Iwata S, Murota K (1994) Block-triangularizations of partitioned matrices under similarity/equivalence transformations. SIAM J. Matrix Anal. Appl. 15(4):1226–1255.Crossref, Google Scholar
[37] Ivanyos G, Qiao Y, Subrahmanyam KV (2017) Non-commutative Edmonds’ problem and matrix semi-invariants. Comput. Complexity 26(3):717–763.Crossref, Google Scholar
[38] Ivanyos G, Qiao Y, Subrahmanyam KV (2018) Constructive non-commutative rank computation is in deterministic polynomial time. Comput. Complexity 27(4):561–593.Crossref, Google Scholar
[39] Iwamasa Y, Oki T, Soma T (2025) Algorithmic aspects of semistability of quiver representations. Censor-Hillel K, Grandoni F, Ouaknine J, Puppis G, eds. 52nd Internat. Colloquium Automata Languages Programming (ICALP 2025) (Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Wadern, Germany), 99:1–99:18.Google Scholar
[40] Kapovich M, Leeb B, Millson J (2009) Convex functions on symmetric spaces, side lengths of polygons and the stability inequalities for weighted configurations at infinity. J. Differential Geometry 81(2):297–354.Crossref, Google Scholar
[41] Karlsson A, Margulis GA (1999) A multiplicative ergodic theorem and nonpositively curved spaces. Comm. Math. Phys. 208(1):107–123.Crossref, Google Scholar
[42] Kempf GR (1978) Instability in invariant theory. Ann. Math. 108(2):299–316.Crossref, Google Scholar
[43] Kirwan F (1984) Convexity properties of the moment mapping. III. Inventiones Math. 77(3):547–552.Crossref, Google Scholar
[44] Kleiner B, Leeb B (2006) Rigidity of invariant convex sets in symmetric spaces. Inventiones Math. 163(3):657–676.Crossref, Google Scholar
[45] Kwok TC, Lau LC, Ramachandran A (2021) Spectral analysis of matrix scaling and operator scaling. SIAM J. Comput. 50(3):1034–1102.Crossref, Google Scholar
[46] Lu H, Freund RM, Nesterov Y (2018) Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1):333–354.Crossref, Google Scholar
[47] Mayer UF (1998) Gradient flows on nonpositively curved metric spaces and harmonic maps. Comm. Anal. Geometry 6(2):199–253.Crossref, Google Scholar
[48] Murota K (2000) Matrices and Matroids for Systems Analysis (Springer-Verlag, Berlin).Google Scholar
[49] Nemirovsky AS, Yudin DB (1983) Problem Complexity and Method Efficiency in Optimization (John Wiley & Sons, Inc., New York).Google Scholar
[50] Obuchowska WT (2004) On the minimizing trajectory of convex functions with unbounded level sets. Comput. Optim. Appl. 27(1):37–52.Crossref, Google Scholar
[51] Ohta S (2025) Discrete-time gradient flows for unbounded convex functions on Gromov hyperbolic spaces. Comm. Contemporary Math., ePub ahead of print December 11, https://doi.org/10.1142/S0219199726500033.Crossref, Google Scholar
[52] Ohta S, Pálfia M (2015) Discrete-time gradient flows and law of large numbers in Alexandrov spaces. Calculus Variations Partial Differential Equations 54(2):1591–1610.Crossref, Google Scholar
[53] Rockafellar RT (1970) Convex Analysis (Princeton University Press, Princeton, NJ).Crossref, Google Scholar
[54] Sakabe K (2026) Nesterov’s accelerated gradient for unbounded convex functions finds the minimum-norm point in the dual space. Preprint, submitted February 9, https://arxiv.org/abs/2602.08618.Google Scholar
[55] Sakabe K, Doğan ML, Walter M (2026) Strassen’s support functionals coincide with the quantum functionals. Preprint, submitted January 29, https://arxiv.org/abs/2601.21553.Google Scholar
[56] Sakai T (1996) Riemannian Geometry (American Mathematical Society, Providence, RI).Crossref, Google Scholar
[57] Sanz Serna JM, Zygalakis KC (2020) Contractivity of Runge-Kutta methods for convex gradient systems. SIAM J. Numer. Anal. 58(4):2079–2092.Crossref, Google Scholar
[58] Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist. 35(2):876–879.Crossref, Google Scholar
[59] Van Dooren P (1979) The computation of Kronecker’s canonical form of a singular pencil. Linear Algebra Appl. 27:103–140.Crossref, Google Scholar
[60] Vishnoi NK (2021) Algorithms for Convex Optimization (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
[61] Wallach NR (2017) Geometric Invariant Theory (Springer, Cham, Switzerland).Crossref, Google Scholar
[62] Woodward CT (2011) Moment maps and geometric invariant theory. Preprint, submitted June 29, https://arxiv.org/abs/0912.1132.Google Scholar

cover image Mathematics of Operations Research

Articles In Advance

Article Information

Metrics

Information

Received:March 02, 2025
Accepted:March 02, 2026
Published Online:April 24, 2026

Cite as

Hiroshi Hirai, Keiya Sakabe (2026) Gradient Descent for Unbounded Convex Functions on Hadamard Manifolds and Its Applications to Scaling Problems. Mathematics of Operations Research 0(0).

https://doi.org/10.1287/moor.2025.0939

Keywords

Acknowledgments

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Gradient Descent for Unbounded Convex Functions on Hadamard Manifolds and Its Applications to Scaling Problems

Abstract

1. Introduction

2. Preliminaries

2.1. Riemannian Geometry

2.1.1. Complex Projective Space.

2.2. Hadamard Manifold

2.2.1. Manifold of Positive Definite Matrices and Symmetric Space.

2.3. Convex Function

3. Asymptotic Behavior of Gradient Flow

3.1. Continuous-Time Gradient Flow

3.2. Discrete-Time Gradient Flow (Gradient Descent)

3.3. Euclidean Specialization

3.3.1. Hessian Riemannian Gradient Flow.

3.3.2. Mirror Descent.

3.3.3. Matrix Scaling and Geometric Programming.

4. Application

4.1. Norm-Minimization in Reductive Group Action

4.1.1. Moment-Weight Inequality and Gradient Flow of Moment-Map Squared.

4.2. Operator Scaling and Its Gradient-Flow Limit

4.2.1. On Finding a Destabilizing 1-PSG.

4.2.2. Matrix Scaling Case.

4.2.3. On the Limit of the Operator Sinkhorn Algorithm.

4.3. Kronecker Form of a Matrix Pencil

References

Articles In Advance

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News