Gradient Descent for Unbounded Convex Functions on Hadamard Manifolds and Its Applications to Scaling Problems

Published Online:https://doi.org/10.1287/moor.2025.0939

Abstract

In this paper, we study the asymptotic behavior of continuous- and discrete-time gradient flows of a “lower-unbounded” convex function f on a Hadamard manifold M, particularly their convergence properties to the boundary M at infinity of M. We establish a duality theorem that the infimum of the gradient-norm f(x) of f over M is equal to the supremum of the negative of the recession function f of f over the boundary M, provided the infimum is positive. Further, the infimum and the supremum are obtained by the limit of the gradient flow of f. Our results feature convex optimization ingredients of the moment-weight inequality for reductive group actions, and are applied to noncommutative optimization. We show that gradient descent of the Kempf-Ness function for an unstable orbit converges to a destabilizing 1-parameter subgroup in the Hilbert-Mumford criterion, and the associated moment-map sequence converges to the minimum-norm point of the moment polytope. We show further refinements for operator scaling—the left-right action on a matrix tuple A=(A1,A2,,AN). We characterize the gradient-flow limit of operator scaling by a vector-space generalization of the classical Dulmage-Mendelsohn decomposition of a bipartite graph. For a special case of N=2, we reveal that the limit determines the Kronecker canonical form of a matrix pencil sA1+A2.

Funding: H. Hirai was supported by the Japan Society for the Promotion of Science (KAKENHI [Grants JP21K19759 and JP24K21315]). K. Sakabe was supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation [Grant 556164098]) and the European Research Council (ERC Starting Grant SYMOPTIC [Grant 101040907]).

1. Introduction

In convex optimization, it is typically assumed that the objective function f is bounded below. The performance of a minimization algorithm is evaluated by its convergence behavior to the minimum of f. This paper addresses the convergence behavior of minimization algorithms for a “lower-unbounded” convex function f, that is, inff(x)=. This may look meaningless, because the trajectory xi of an algorithm diverges to infinity, and f(xi) goes to . The meta question of the paper is the following:

What can we gain from such a divergent sequence?

Let us formalize our setting and mention its background. Let M be a Hadamard manifold—a simply connected complete Riemannian manifold with nonpositive sectional curvature. Let f:MR be a (twice differentiable) geodesically convex function, that is, f is convex along any geodesic. We consider the following unconstrained convex optimization problem on M:

inf.f(x) s.t.xM, where f can be lower-unbounded.(1.1)

Such a problem setting is significant in the recent progress on operator scaling (Gurvits [27]) and generalizations (see Allen-Zhu et al. [1], Bürgisser et al. [12], Bürgisser et al. [13], Garg and Oliveira [21], Garg et al. [22, 23], and Hirai et al. [34]). In the classical matrix scaling (Sinkhorn [58]), the scalability is equivalent to the boundedness of (1.1) for some convex function f in Rn. Further, it is also equivalent to the perfect-matching condition of the associated bipartite graph. Hayashi et al. [29] studied asymptotic behavior of the Sinkhorn algorithm for the unscalable (unbounded) case, and revealed that a combinatorial certificate (Hall blocker) of unscalability can be identified from divergent behavior of the Sinkhorn algorithm. Although a Hall blocker is easily obtained by network-flow algorithms, finding the corresponding certificate (shrunk subspace) for the operator scaling setting is possible but quite difficult (see Hamada and Hirai [28], Ivanyos et al. [37, 38]). Just recently, Franks et al. [19] modified the operator Sinkhorn algorithm—an alternating minimization algorithm for some convex function on the Hadamard manifold of positive definite matrices—to obtain a shrunk subspace in polynomial time, although it is still rather complicated. The matrix and operator scaling problems are generalized to a class of convex optimization involving reductive group actions, called noncommutative optimization (Bürgisser et al. [13]), which asks to minimize the Kempf-Ness function associated with an orbit of the action. This is formulated as a convex optimization problem on a representative class of Hadamard manifolds—symmetric spaces of nonpositive curvature. It is lower-unbounded if and only if the orbit is unstable, where a 1-parameter subgroup (destabilizing 1-PSG) in the Hilbert-Mumford criterion is the unboundedness certificate that generalizes a Hall blocker and a shrunk subspace. As mentioned in Bürgisser et al. [13], it is a great challenge to design polynomial-time algorithms for several noncommutative optimization problems, such as (un)stability determination, moment polytope membership, and orbit-closure intersection, which will bring fruitful applications to broader areas of mathematical sciences. Many of them involve (un)bounded determination of Kempf-Ness functions, though our current knowledge on such problems is limited.

Motivated by these considerations, we study minimization of lower-unbounded convex functions on Hadamard manifolds. Even in the Euclidean setting M=Rn, there are few works (see, e.g., Auslender [4] and Obuchowska [50]) on such study. We focus on asymptotic behavior of the simplest algorithm—gradient descent. Accompanying this, we also consider its continuous version—gradient flow, that is, a trajectory produced by the differential equation x˙(t)=f(x(t)).

The contributions and organization of this paper are summarized as follows. We begin with a general study of the asymptotic behavior of the gradient flow/descent for an unbounded convex function f on a Hadamard manifold M. As in the Euclidean setting, the recession function (asymptotic slope) f of f (see Hirai [30] and Kapovich et al. [40]) is a basic tool of analyzing unboundedness, which is a function defined on the boundary M at infinity of M . Intuitively, the boundary M is the set of all directions ξ from an arbitrary fixed point x0, and f(ξ) represents the slope of f along the direction ξ at infinity. Then, Hadamard manifold M admits compactification MM, where the resulting topology is called the cone topology. These notions and related manifold terminologies are summarized in Section 2.

We focus on convergence properties, with respect to the cone topology, of the gradient flow/descent for an unbounded convex function f. In Section 3, under a sufficient condition infxMf(x)>0 of unboundedness, we establish in Theorem 3.1 that the gradient flow x(t) converges to a point of boundary M with provision of the following min-max (inf-sup) relation:

limtf(x(t))=infxMf(x)=supξMf(ξ)=f(limtx(t)).(1.2)

The limit limtx(t) is the unique minimizer of f over M, and is a certificate of unboundedness. Further, we also show in Theorem 3.7 that the same result holds for the sequence xi produced by gradient descent applied to an L-smooth convex function f with step-size 1/L. These are the core results of the paper that drive the subsequent arguments.

Even in the Euclidean setting M=Rn, these convergence results on the gradient flow/descent seem new, and bring an interesting ramification (Theorem 3.15): both f(x(t)) and f(xi) converge to the minimum-norm point p* of the gradient space f(Rn)¯ (which is convex). This means that gradient descent is interpreted as a minimum-norm point algorithm in the gradient space. Other interesting connections to and implications for Hessian Riemannian gradient flow (Alvarez et al. [2]), mirror descent (Nemirovsky and Yudin [49]), and geometric programming are also mentioned.

In Section 4, we present applications. In Section 4.1, we deal with the norm-minimization problem for a reductive group action on a complex vector/projective space. As mentioned, this is the problem of minimizing the Kempf-Ness function fv associated with an orbit of the action. Then, gradient descent is essentially the first-order algorithm in Bürgisser et al. [13]. Applying our results, we show that for the unstable case the trajectory of the first-order algorithm converges, in cone topology, to the unique minimizer of fv, which yields a destabilizing 1-PSG in the Hilbert-Mumford criterion. Further, the spectrum of the moment map (= transported gradient of fv) along the trajectory converges to the minimum-norm point of the moment polytope Δv. For the gradient-flow setting, we reveal the connection to the theory of the moment-weight inequality for reductive group actions, developed by Georgoulas et al. [24], building upon the earlier work by Kempf, Kirwan, Mumford, and Ness in geometric invariant theory (GIT) and the recent work by Chen and Sun [15, section 4] in K-stability. Specifically, the weak duality f(x)f(ξ) in (1.2) becomes the moment-weight inequality, and the strong duality via the gradient flow can explain important parts of their theory. It may be fair to say that our results in Section 3 extract and discretize convex optimization ingredients of their theory.

In Section 4.2, we focus on the left-right action SLn(C)×SLm(C)(g,h)gAh on a matrix tuple A=(A1,A2,,AN), which corresponds to the operator scaling problem. In this setting, the middle equality in (1.2) is interpreted as a duality theorem for the scalability limitation (Theorem 4.20), which sharpens Gurvits’ characterization in the inf-sup form. We then study the limit of the gradient flow/descent for the Kempf-Ness function (g,h)loggAh. Our focus is on the unscalable case, whereas the scalable case was studied in detail by Kwok et al. [45]. We show in Theorems 4.24 and 4.27 that the minimum-norm point of the moment polytope ΔA and the limit of the gradient flow/descent are characterized by a certain simultaneous block-triangularization of A=(A1,A2,,AN), which is a vector-space generalization of the classical Dulmage-Mendelsohn decomposition (DM-decomposition) (Dulmage and Mendelsohn [17]) of a bipartite graph. More specifically, the sequence of (normalized) scaling tuples gkAhk/gkAhk along the gradient descent converges to a block-diagonal matrix modulo the left-right unitary group action, where the block structure is determined by our generalized DM-decomposition. This answers the gradient-descent variant of an open question by Garg and Oliveira [21, section 6] for asking asymptotic behavior of the operator Sinkhorn algorithm for unscalable instances. Finding this block structure itself is significant. We partially eliminate the unitary indeterminacy from gkAhk, and exploit a convergent sequence to a coarse block-triangular structure (Theorem 4.28). This leads to a new construction of a shrunk subspace (certificate of unscalability) by gradient descent combined with the rounding procedure in Franks et al. [19].

In Section 4.3, for a special case of N=2, we reveal that our DM-decomposition of (A1,A2) coarsens and determines the well-known Kronecker canonical form of a matrix pencil sA1+A2. The Kronecker form plays important roles in systems analysis by a differential-algebraic equation (DAE) A1u˙(t)+A2u(t)=0. Its computation has been studied for a long time in the literature of numerical computation (see, e.g., Demmel and Kåragström [16] and Van Dooren [59]). Our convergence result (Theorem 4.33) suggests a new iterative method for determining the Kronecker structure, which is based on simple gradient descent and is conceptually different from the existing ones.

These results may be positioned as attempts at detecting, by algorithms in M, hidden structures in the boundary M at infinity, which have been little studied so far. We hope that our attempts lead to more serious studies from a computational complexity perspective. Particularly, it is an important future direction to improve the present convergence-type results to the ones having explicit iteration complexity.

After the submission of this paper, there have been several subsequent developments (Hirai [31, 32], Ohta [51], Sakabe [54], Sakabe et al. [55]).

2. Preliminaries

Let R and R+ denote the sets of real and nonnegative real numbers, respectively. We often add to R and R+ the infinity elements ±, where the topology and ordering are extended in the usual way. Let C denote the set of complex numbers z=x+iy, where z¯ denotes the complex conjugate xiy and Rez denotes the real part x. The same notation is used for a complex vector ζ=u+ivCn with u,vRn as ζ¯=uiv. For a matrix A over C, let A denote the transpose conjugate. For sets I and J of row indices and column indices of A, let A[I, J] denote the submatrix of A with row indices in I and column indices in J. For two matrices A, B (of possibly different sizes), let AB denote the block-diagonal matrix of block-diagonals A, B in order. For a vector pRn, let diagp denote the n×n diagonal matrix with (diagp)ii=pi.

The general linear group GL(n,C) and the special linear group SL(n,C) over C are simply denoted by GLn and SLn, respectively. The unitary group U(n) and the special unitary group SU(n) are denoted by Un and SUn, respectively. For a finite-dimensional vector space V over C, let GL(V) denote the group of linear isomorphisms on V.

For a positive integer n, let [n]{1,2,,n}. For X[n], let 1XRn be defined by (1X)i=1 if iX and 0 otherwise, where 1[n] is simply written as 1.

A sequence (xi)i=0,1,, and function (x(t))tR+ are simply denoted by xi and x(t), respectively. For a real-valued sequence ai and continuous function h(t), we will use several times the following:

lim infiailim infi1ij=1iajlim supi1ij=1iajlim supiai,(2.1)
lim infth(t)lim inft1t0th(s)dslim supt1t0th(s)dslim supth(t).(2.2)

This is a little exercise in calculus. For example, the leftmost in (2.2) follows from this: Suppose that αlim infth(t)R. Then ϵ>0, N0, tN, h(t)αϵ, and hence, tN, 1t0th(s)ds1t0Nh(s)ds+tNt(αϵ)tαϵ. Because ϵ is arbitrary, we have lim inft1t0th(s)dsα.

2.1. Riemannian Geometry

We will utilize standard terminologies and notation on Riemannian geometry (see, e.g., Sakai [56]). See also a recent book (Boumal [8]) for the optimization perspective. We assume sufficient differentiability for manifolds, functions, maps, and vector/tensor fields on them. Let M be a Riemannian manifold. For xM, let Tx=Tx(M) denote the tangent space of M at x, where ·,·=·,·x denotes the Riemannian metric at x and ··,· denotes the associated norm. Let Sx{uTxu=1} and Bx{uTxu1} denote the unit sphere and ball in Tx, respectively. The angle (u,v) of two vectors u,vTx is defined as cos1(u,v/uv). The product M×M of two Riemannian manifolds M,M is viewed as a Riemannian manifold by setting (u,u),(v,v)(x,x)u,vx+u,vx.

For a path γ:[a,b]M and t[a,b], let γ˙(t) denote the tangent vector of γ at Tγ(t). The length of the path γ is defined by abγ˙(t)dt. The distance d(x,y) between x,yM is the infimum of the length of a path connecting x and y. We consider the Levi-Civita connection associated with the Riemannian metric. The connection determines the parallel transport τγt:Tγ(0)Tγ(t) along any path γ:[0,b]M with t[0,b], where τγt(τγt)1. By using the parallel transport, the covariant derivative uV of a vector field V=(Vx)xM by uTx is given by uV(d/dt)τγtVγ(t)t=0, where γ is a path with γ(0)=x and γ˙(0)=u.

In this paper, any manifold M is assumed to be complete. That is, the metric space (M,d) is complete. Then, the distance d(x,y) is always attained by a geodesic—a path γ:[a,b]M satisfying γ˙(t)γ˙=0 for t[a,b]. By a unit-speed geodesic ray, we mean a geodesic γ:[0,)M with γ˙(0)=1 (and then γ˙(t)=1 for all t). For xM and uTx, there is a unique geodesic γ(t) with γ(0)=x and γ˙(0)=u, denoted by expxtu. By completeness of M, the map texpxtu is defined on R+. This gives rise to a surjective map expx:TxM, called the exponential map.

For a map φ:MN, where N is another manifold, let dφ:Tx(M)Tφ(x)(N) denote the differential of φ at xM. The differential df=dfx:TxR of a function f:MR is given by df(u)=(d/dt)f(γ(t))t=0, where γ is a path with γ(0)=x and γ˙(0)=uTx. The gradient f(x)Tx of f is then defined via

f(x),udf(u)(uTx).

The covariant differentiation of the gradient vector field f gives rise to the Hessian 2f(x):TxTx:

2f(x)uuf(x)(uTx).(2.3)

The Hessian is a symmetric operator in the sense that 2f(x)u,v=2f(x)v,u.

2.1.1. Complex Projective Space.

We will consider the complex projective space as a Riemannian manifold. Let V be an n-dimensional vector space over C. The complex projective space P(V) over V is a quotient manifold V{0}/ by the equivalent relation vvv=αv (αC{0}). The image of v by V{0}P(V) is denoted by [v]. A Riemannian structure on P(V) is given by the Fubini-Study form as follows. Let (·,·) be a Hermitian inner product on V. Regard V as a 2n-dimensional Euclidean space by the real inner product Re(·,·). This induces a Riemannian structure on the sphere S2n1={vVv=r}, where we set r2. Further, U1(=U(1)) acts isometrically on S2n1 by scalar multiplication U1×S2n1(eiθ,v)eiθv. Then P(V) is viewed as the Riemannian quotient of S2n1 with respect to this action. The resulting metric on P(V) is called the Fubini-Study metric. See, for example, Boumal [8, chapter 9] for Riemannian quotient manifolds.

2.2. Hadamard Manifold

A Hadamard manifold M is a simply connected complete Riemannian manifold having nonpositive sectional curvature everywhere (see Sakai [56, V.4]). For any two points in M, a geodesic connecting them is uniquely determined (up to affine rescaling). The exponential map expx is a diffeomorphism from Tx to M. The parallel transport from Tx to Ty along the geodesic is simply denoted by τxy.

In this paper, the boundary M at infinity and the cone topology on MM play particularly important roles. See Sakai [56, V.4.2]) for a quick introduction to these notions. Two unit-speed geodesic rays γ,γ:R+M are called asymptotic if d(γ(t),γ(t))<C (tR+) for some constant C>0. The asymptotic relation is an equivalence relation on the set of all unit-speed geodesic rays. Let M denote the set of all equivalence classes. Let us fix an arbitrary point xM. Any unit vector uSx defines an asymptotic class of unit-speed geodesic ray texpxtu. This correspondence is a bijection between Sx and M, and induces a topology on M that is isomorphic to the sphere Sx. In fact, this topology is independent of the choice of x. Further, the topologies on M and on M are extended to MM as follows. Because expx is a diffeomorphism, it holds that M(Sx×R+)/0, where 0 is the equivalence relation defined by (u,r)0(u,r)  (u,r)=(u,r) or r=r=0. With MSx×{}, we obtain the compact Hausdorff space MM(Sx×(R+{}))/0 (isomorphic to Bx). This topology on MM is called the cone topology. In this topology, a sequence xi in M converges to ξM if and only if

  • d(x,xi), and

  • the sequence ui in Sx determined by xi=expxd(x,xi)ui converges to uSx, where the asymptotic class of geodesic texpxtu is equal to ξ.

The angle (ξ,ξ) of two points ξ,ξM is defined as supxM(u,u), where u and u are the representatives of ξ and ξ, respectively, at Tx. The angle defines a metric on M, which induces a different topology. By using the angle metric on M, we can define a metric d on the Euclidean cone CM(M×R+)/0 of the boundary M by d((ξ,r),(ξ,r))2=r2+(r)22rrcos(ξ,ξ). This space CM is viewed as the space of asymptotic classes of (not necessarily unit-speed) geodesic rays. It is identified with Tx, though the metric space (CM,d) has a different topology from Tx and is not necessarily a manifold. This metric space (CM,d) is a Hadamard space—a complete geodesic metric space satisfying the CAT(0)-inequality (Bridson and Haefliger [9]). It is uniquely geodesic, and its convexity is defined along geodesics. The unit ball B={pCMd(0,p)1} around the origin 0 is a convex set, where the origin 0 is the image of point (ξ,0). Observe that B can be identified with Bx for any xM.

2.2.1. Manifold of Positive Definite Matrices and Symmetric Space.

A representative example of a Hadamard manifold is the space Pn of n×n positive definite Hermitian matrices (see Bridson and Haefliger [9, II.10]). The tangent space Tx at xPn is identified with the real vector space pn of Hermitian matrices, and the Riemannian metric is given by G,Hxtrx1Hx1G. In this space, several manifold notions are explicitly written (see, e.g., Hirai et al. [34, section 5.2]). The exponential map expx at x is given by Hx1/2ex1/2Hx1/2x1/2, where e is the matrix exponential. Particularly, any geodesic issuing at x is of form tx1/2etx1/2Hx1/2x1/2 for some Hermitian matrix HTx with H=x1/2Hx1/2F=1, where ·F is the Frobenius norm. An explicit formula of the geodesic parallel transport τxy is also known. We will use one special case: τxIH=x1/2Hx1/2.

Any totally geodesic subspace M of Pn is also a Hadamard manifold. Here, a submanifold MPn is said to be totally geodesic if every geodesic in M is also geodesic in Pn. It is known (Bridson and Haefliger [9, II.10.58]) that for a connected Lie group GGLn defined by polynomials and satisfying G=G, the submanifold PnG is a totally geodesic subspace. Such a group G is called self-adjoint (or symmetric), and is a reductive algebraic group (see Wallach [61, sections 2.2, 3.1.3, and 3.2]). Here PnG is known as a symmetric space (of nonpositive curvature). A particular case we will face is G=SLn and Pn1PnSLn={xPndetx=1}, where the tangent space TI(Pn1) at I is given by pn1{HpntrH=0}. It is known (Bridson and Haefliger [9, II.10.71]) that the boundary M at infinity of M=PnG becomes a spherical building, and the associated Euclidean cone CM becomes a Euclidean building. We will consider convex functions on these spaces in Section 4.

2.3. Convex Function

In a Hadamard manifold M, by uniqueness of geodesics, convexity is naturally introduced. A function f:MR is said to be convex if for every geodesic γ:[a,b]M one-dimensional function fγ:[a,b]R is convex. We will assume twice differentiability for the smoothness of f. Then the convexity condition is equivalent to (fγ)(t)0. From (fγ)(t)=(d/dt)f(γ(t)),γ˙(t)=γ˙(t)f(γ(t)),γ˙(t), convexity of f is equivalent to positive semidefiniteness of Hessian 2f(x):

2f(x)u,u0
for all xM,uTx. We also consider the Lipschitz condition for the gradient vector field f. For LR+, a function f:MR is said to be L-smooth if
2f(x)u,uLu,u
for all xM, uTx. That is, the operator norm 2f(x) is bounded by L for all xM.

We next introduce an important tool for studying the unboundedness of convex functions. Let us fix x0M. The recession function (asymptotic slope) f=fx0:MR{} (Hirai [30], Kapovich et al. [40]) is defined by

fx0(u)limsf(expx0su)f(x0)s=limsf(expx0su)s
=limsddsf(expx0su)(uSx0M),(2.4)
where the limits exist in R{} because of convexity of f (monotonicity of s(f(expx0su)f(x0))/s and of s(d/ds)f(expx0su)) and the last equality follows from (2.2) for h(t)(d/dt)f(expx0tu). It is shown by Kleiner and Leeb [44, lemma 2.10] that if texpx0tu and texpy0tv are asymptotic, then fx0(u)=fy0(v).1 Hence, the recession function f is regarded as MR{}. Further, f is naturally extended to CMR{} by allowing u to any vector in Tx0CM. If M=Rn, then CM=Rn and f matches the recession function in Euclidean convex analysis (see Rockafellar [53, section 8] and Hiriart-Urruty and Lemaréchal [35, section 3.2]). As in the Euclidean case, the following properties hold:
infξMf(ξ)<0infxMf(x)=.infξMf(ξ)>0x*M:f(x*)=infxMf(x).(2.5)

The second property is included in Kapovich et al. [40, lemma 3.2 (vi)]. Moreover, it is known (Hirai [30]) that f is a positively homogeneous convex function on Hadamard space CM.

In particular, both infξMf(ξ)<0 and infxMf(x)>0 are sufficient conditions for unboundedness of f. In fact, they are equivalent.

Proposition 2.1

(Kapovich et al. [40, Lemma 3.2 (iii), Lemma 3.4]; See Also Hirai [30]).

  • (1) infξMf(ξ)<0 if and only if infxMf(x)>0.

  • (2) If infξMf(ξ)<0, then there uniquely exists ξ*M with f(ξ*)=infξMf(ξ).

The existence in Proposition 2.1 part (2) follows from the lower semicontinuity of f on the compact space M with respect to the cone topology. The uniqueness of ξ* in part (2) can be seen from the positively homogeneous convexity of f on CM, as in the Euclidean case.2

As a sharpening of the easier part (the only-if part) in part (1), we here mention the following weak duality relation between the gradient-norm and the recession function.

Lemma 2.2

(Weak Duality). infxMf(x)supξBf(ξ).

Proof.

For xM and ξBxB, it holds that

f(ξ)=limtf(expxtξ)f(x)tlimt0f(expxtξ)f(x)t=f(x),ξf(x),
where the first inequality follows from convexity of f (monotonicity of t(f(expxtξ)f(x))/t) and the last inequality follows from Cauchy-Schwarz and ξ1. □

In Section 3, we show, via the gradient flow of f, that the equality (strong duality) always holds. This technique may be viewed as a refinement of the proof of the if-part in Kapovich et al. [40, proposition 2.1 (1)], in which the limit of the normalized gradient flow of f constructs ξ with f(ξ)<0. A similar gradient-flow approach can be found in the setting of GIT (Chen and Sun [15], Georgoulas et al. [24], Woodward [62]) (see Section 4.1).

3. Asymptotic Behavior of Gradient Flow

3.1. Continuous-Time Gradient Flow

Throughout, M denotes a Hadamard manifold. Let f:MR be a twice differentiable convex function. Consider the following differential equation—the gradient flow of f,

dx(t)dt=f(x(t)),x(0)=x0.(3.1)

It is clear that the trajectory x(t) is going to minimize f; see Lemma 3.2 part (2) below. In fact, if a minimizer of f exists, then x(t) converges to a minimizer. This convergence is known for the general setting of Hadamard spaces (see, e.g., Bačák [5, theorem 5.1.16] and Mayer [47, theorem 2.41]). Our focus is on the case where f is unbounded below, particularly the case where the minimum gradient-norm is positive. We establish the following convergence of an unbounded gradient flow and strong duality between the gradient-norm and the recession function.

Theorem 3.1.

Suppose that κ*infxMf(x)>0. Let x(t) be the solution of (3.1).

  • (1) f(x(t)) converges to the minimum gradient-norm κ*, and

  • (2) x(t) converges, in cone topology, to the unique minimizer ξ* of f over M,

where the following equality holds:
limtf(x(t))=infxMf(x)=supξMf(ξ)=f(limtx(t)).(3.2)

We should mention related results. In the general setting of Hadamard space X, Caprace and Lytchak [14, proposition 4.2] showed that the gradient-flow curve of a Lipschitz convex function with κ*>0 converges to a point in the boundary X of X. Their proof relies on a very general result of Karlsson and Margulis [41, theorem 2.1] for semicontraction semigroups in uniformly convex spaces. Here it is well-known3 that the gradient-flow semigroup ϕt satisfies the (semi)contraction property:

d(ϕt(x),ϕt(y))d(x,y)(tR+,x,yM),(3.3)
where ϕt(x) is the solution of (3.1) with initial point x(0)=x. If the velocity of escape
κ*(x)lim suptd(ϕt(x),x)t(3.4)
is positive, then the result of Karlsson and Margulis [41, theorem 2.1] is applicable for convergence of ϕt(x) in M; Caprace and Lytchak [14] actually showed that κ*>0 implies κ*(x)>0. Although one can deduce the entire statement of Theorem 3.1 from this with more effort, we take a different approach that relies neither on Karlsson and Margulis [41] nor on the contraction property (3.3). As mentioned after Lemma 2.2, our proof is partly inspired by an idea in Kapovich et al. [40], but it directly establishes the relation (3.2). An advantage of this approach is that it can adapt to the discrete setting in Section 3.2.

We start with the following well-known properties of gradient flows.

Lemma 3.2.

  • (1) The solution x(t) of (3.1) is defined on R+.

  • (2) tf(x(t)) is nonincreasing.

  • (3) tf(x(t)) is nonincreasing.

We describe a proof because the intermediate equations will be used.

Proof.

Lemma 3.2 part (2) follows from (d/dt)f(x(t))=f(x(t)),x˙(t)=f(x(t))20.

Part (3) follows from (d/dt)f(x(t))2=22f(x(t))x˙(t),x˙(t)0 by convexity of f (positive semidefiniteness of 2f(x(t))).

Part (1). Suppose that x(t) is defined on [0,T) for finite T>0. For 0tt<T, it holds that

d(x(t),x(t))ttx˙(s)dsf(x0)(tt),
where the second inequality follows from part (3). Therefore, x(t) is Cauchy for tT. Because M is complete, the limit x*limtTx(x) exists in M. Then x(t) is connected to the solution of y˙(t)=f(y(t)), y(0)=x*, and is defined on [0,T+ϵ) for some ϵ>0. If we take maximal T, it must be T=. □

Proof of Theorem 3.1.

Let κlimtf(x(t))κ*>0. First, we note

f(x(t))f(x0)=0tddτf(x(τ))dτ=0tf(x(τ))2dτκ2t,(3.5)
d(x(t),x0)0tx˙(τ)dτ=0tf(x(τ))dτ,(3.6)
where the last inequality in (3.5) follows from Lemma 3.2 part (3). Then it holds that d(x(t),x0) for t. Otherwise, x(t) has an accumulation point x* in M and f(x*)= by (3.5), contradicting f(x*)R.

Define u(t)Sx0 via x(t)=expx0d(x(t),x0)u(t). For s(0,d(x(t),x0)], by convexity of f along the geodesic from x0 to x(t), it holds that

f(expx0su(t))f(x0)sd(x(t),x0)(f(x(t))f(x0)).

From this, we have

f(expx0su(t))f(x0)sf(x(t))f(x0)d(x(t),x0)0tf(x(τ))2dτ0tf(x(τ))dτ1t0tf(x(τ))dτκ,
where the second inequality follows from (3.5) and (3.6), the third from the Cauchy-Schwartz inequality (0tF(τ)G(τ)dτ)20tF(τ)2dτ0tG(τ)2dτ for F(τ)f(x(τ)) and G(τ)1, and the fourth from Lemma 3.2 part (3).

Choose any convergence subsequence u(ti) with ti (d(x(ti),x0)) and u(ti)u*. Then it holds that

f(expx0su*)f(x0)sκ.

For s, we have f(u*)κ. Then, we have

infξMf(ξ)f(u*)κκ*=supxMf(x)infξMf(ξ),
where we use the weak duality (Lemma 2.2) for the last inequality. This shows κ=κ* and proves (3.2). Because the minimizer ξ* of f over M uniquely exists (Proposition 2.1 part (2)), it must hold that ξ*=u*. We showed that any convergent subsequence u(ti) of u(t) converges to ξ*. Because Sx0 is compact, u(t) itself converges to ξ*. □

Even if κ*=0, the strong duality holds (because f(0)=0).

Corollary 3.3.

infxMf(x)=supξBf(ξ).

The velocity of escape (3.4) coincides with the minimum gradient-norm.

Proposition 3.4.

Suppose that κ*infxMf(x)>0. Let ξ*Sx0 denote the representative of the unique minimizer of f over MSx0. Then the following hold:

  • (1) limtd(x0,x(t))t=κ*.

  • (2) limtexpx01x(t)t=κ*ξ*.

Proof.

Part (1). For t>s0, it holds that d(x(s),x(t))stf(x(τ))dτf(x(s))(ts) (by Lemma 3.2 part (3)). Hence,

lim suptd(x0,x(t))t=lim suptd(x(s),x(t))tsf(x(s))sκ*,(3.7)
where the convergence of f(x(s)) to κ* follows from Theorem 3.1 part (1). On the other hand, by taking the unit-speed geodesic γ from x(s) to x(t), we have
f(x(t))2(ts)stf(x(τ))2dτ=f(x(t))f(x(s))γ˙(0),f(x(s))d(x(s),x(t))f(x(s))d(x(s),x(t)),
where the first equality follows from Lemma 3.2 part (3), the second inequality from convexity of f along γ, and the last from the Cauchy-Schwarz inequality. Thus, it holds that
lim inftd(x0,x(t))tlim inftd(x(t),x(s))d(x0,x(s))t=lim inftd(x(t),x(s))tslimtf(x(t))2f(x(s))=(κ*)2f(x(s))sκ*.(3.8)

By (3.7) and (3.8), we have

κ*lim inftd(x0,x(t))tlim suptd(x0,x(t))tκ*.

Part (2). By Theorem 3.1, it holds that limtexpx01x(t)d(x0,x(t))=ξ*. Therefore, by part (1), we have

limtexpx01x(t)t=limtexpx01x(t)d(x0,x(t))d(x0,x(t))t=κ*ξ*.

We next consider “convergence” of the gradient f(x(t)). Because the space Tx(t) varies, the convergence concept of f(x(t)) is less obvious. In our intuition, f(x(t)) and ξ* would have opposite directions in the limit. The following partially justifies this intuition.

Proposition 3.5.

Suppose that κ*infxMf(x)>0. Let ξ*Sx0 denote the representative of the unique minimizer of f over MSx0. Then it holds that

lim inftτx(t)x0f(x(t))+κ*ξ*=0.

Question 3.6.

Does limtτx(t)x0f(x(t))=κ*ξ* hold?

We will see in Section 4 that this property has important consequences.

Proof of Proposition 3.5.

Let γt be the unit-speed geodesic from x0 to x(t). Let d(t)d(x0,x(t)). Then, by Sakai [56, chapter III, proposition 4.8 (1)], it holds that d(t)=γ˙t(d(t)),x˙(t). Therefore, we have

lim suptd(t)=lim suptγ˙t(d(t)),x˙(t)limtx˙(t)=limtf(x(t))=κ.(3.9)

On the other hand, by Proposition 3.4, it holds that κ*=lim suptd(t)/tlim suptd(t), where the inequality follows from (2.2) with h(t)d(t). Thus, the equality holds in (3.9). Necessarily, we have

lim supt(γ˙t(d(t)),f(x(t)))=π.(3.10)

By f(x(t))κ*, we have lim inftf(x(t))+κ*γ˙t(d(t))=0. With parallel transport τx(t)x0 and γ˙t(0)ξ*, we have the claim. □

3.2. Discrete-Time Gradient Flow (Gradient Descent)

Next, we consider the discrete version. Suppose that f:MR is an L-smooth convex function. Consider the following sequence:

xi+1expxi(1Lf(xi))(i=0,1,).(3.11)

This is nothing but the trajectory generated by gradient descent with initial point x0 and step-size 1/L; we discuss in Remark 3.13 another type of discrete gradient flow. The convergence/accumulation of xi to a minimizer of f can be shown under several reasonable assumptions (see, e.g., Boumal [8, theorem 11.29]). For the unbounded case, as in the continuous setting, we establish the following.

Theorem 3.7.

Suppose that κ*infxMf(x)>0. Let xi be the sequence in (3.11).

  • (1) f(xi) converges to the minimum gradient-norm κ*, and

  • (2) xi converges, in cone topology, to the unique minimizer ξ*M of f.

Hence, the following holds:

limif(xi)=infxMf(x)=supξMf(ξ)=f(limixi).(3.12)

Our original attempt proving this was to establish the contraction property

d(ϕi(x),ϕi(y))d(x,y)(x,yM,i=1,2,),(3.13)
for the semigroup ϕi of (3.11), and to apply the approach of Caprace and Lytchak [14] and Karlsson and Margulis [41]. However, we were unable to do so, and we do not know whether (3.13) is true. Note that (3.13) is true in Euclidean space M=Rn (see, e.g., Sanz Serna and Zygalakis [57, example 1]).

The proof goes in a way analogous to Theorem 3.1. Corresponding to Lemma 3.2, the following properties hold.

Lemma 3.8.

  • (1) f(xi+1)f(xi)1Lf(xi+1)2.

  • (2) f(xi+1)f(xi).

Contrary to the well-known inequality f(xi+1)f(xi)(1/2L)f(xi)2 (see Boumal [8, (11.15)]), our inequality Lemma 3.8 part (1) seems less well-known; see Remark 3.14 for further discussion.

Proof.

Part (2). Let γ(t)expxitf(xi). Then we have

τγ1/Lf(xi+1)=f(xi)+01/Lddsτγsf(γ(s))ds=f(xi)+01/Lτγsγ˙(s)f(γ(s))ds=f(xi)+01/Lτγs2f(γ(s))γ˙(s)ds=L01/Lτγs(I1L2f(γ(s)))τγsf(xi)ds,(3.14)
where we use the definition (2.3) of 2 and γ˙(s)=τγsγ˙(0)=τγsf(xi) as γ is a geodesic. Because , is invariant under parallel transport, the operator norm of τγs(I(1/L)2f(γ(s)))τγs is equal to that of I(1/L)2f(γ(s)). By convexity and L-smoothness, all eigenvalues of 2f(γ(s)) belong to [0, L]. Hence, we have
f(xi+1)=τγ1/Lf(xi+1)L01/LI1L2f(γ(s))f(xi)dsf(xi),
which proves part (2).

We now prove part (1). From (3.14), we have

τγ1/Lf(xi+1)12f(xi)=L01/Lτγs(12I1L2f(γ(s)))τγsf(xi)dsL01/L(12I1L2f(γ(s)))f(xi)ds12f(xi).

By squaring this and applying the rearrangement ab2b2a22a,b, we have τγ1/Lf(xi+1)2τγ1/Lf(xi+1),f(xi), particularly,

f(xi+1)2f(xi+1),τγ1/Lf(xi).(3.15)

From convexity, it holds that

f(xi)f(xi+1)+1Lddtf(γ(1/Lt))t=0=f(xi+1)1Lf(xi+1),γ˙(1/L)=f(xi+1)+1Lf(xi+1),τγ1/Lf(xi)f(xi+1)+1Lf(xi+1)2,
where we use (3.15) for the last inequality. □

Proof of Theorem 3.7.

The proof is similar to that of Theorem 3.1. Let κlimif(xi)κ*. For i>0, we have

f(xi)f(x0)1Lk=1if(xk)2iLκ,(3.16)
d(xi,x0)k=0i1d(xk,xk+1)=1Lk=0i1f(xk),(3.17)
where (3.16) follows from Lemma 3.8 and (3.17) follows from the triangle inequality and d(x,expxu)=u with (3.11). Then d(xi,x0) is shown as in the proof of Theorem 3.1.

Let uiSx0 be defined via xi=expx0d(xi,x0)ui. For s(0,d(xi,x0)], by convexity of f along geodesic sexpx0sui, it holds that

f(expx0sui)f(x0)sd(xi,x0)(f(xi)f(x0)).

From this, we have

f(expx0sui)f(x0)sf(xi)f(x0)d(xi,x0)k=1if(xk)2d(xi,x0)k=1if(xk)2k=0i1f(xk)=k=0i1f(xk)2k=0i1f(xk)+f(x0)2f(xi)2k=0i1f(xk)1ik=0i1f(xk)+f(x0)2k=0i1f(xk)κ+1if(x0)2κ,(3.18)
where the second inequality follows from (3.16), the third from (3.17) and the negativity of the numerator, the fourth from the Cauchy-Schwarz inequality (kFkGk)2kFk2kGk2, and the fifth from Lemma 3.8 part (2).

Choose any convergent subsequence {uik} of {ui}, which converges to u*Sx0. The second term of (3.18) vanishes as ik. Then it holds that

f(expx0su)f(x0)sκ.

By s, we have f(u*)κ. The rest is the same as the last part of the proof of Theorem 3.1. □

We note the limiting behavior of the decrement of f(xi) and the change of f(xi).

Lemma 3.9.

  • (1) limif(xi+1)f(xi)=(κ*)2L.

  • (2) limiτxixi+1f(xi)f(xi+1)=0.

Proof.

Part (1). By convexity and Lemma 3.8 part (1), we have

1Lf(xi)2f(xi+1)f(xi)1Lf(xi+1)2.

By i with Theorem 3.7, we have the claim.

Part (2). The inequality (3.15) is also written as

f(xi+1)2f(xi)f(xi+1)cos(f(xi+1),τxixi+1f(xi)).

By f(xi)κ, we have (f(xi+1),τxixi+1f(xi))0, and the claim follows. □

The discrete version of Proposition 3.4 is the following.

Proposition 3.10.

Suppose that κ*infxMf(x)>0. Let ξ*Sx0 denote the representative of the unique minimizer of f over MSx0.

  • (1) limid(x0,xi)i=κ*L.

  • (2) limiexpx01xii=κ*ξ*L.

Proof.

Part (1). As in (3.17), it holds that d(x0,xi)k=0i1d(xk,xk+1)=k=0i1f(xk)/L. Hence, with (2.1) for aif(xi), we have

lim supid(x0,xi)i1Llim supi1ik=0i1f(xk)1Llim supkf(xk)=κL.(3.19)

On the other hand, for arbitrary 0i<j, we have

jiLf(xj)21Lk=i+1jf(xk)2(f(xj)f(xi))γ˙(0),f(xi)d(xi,xj)f(xi)d(xi,xj),
where the first inequality follows from Lemma 3.8 part (2), the second from Lemma 3.8 part (1), and the third from the convexity of f along unit-speed geodesic γ from xi to xj. Thus, for arbitrary i0, it holds that
lim infjd(x0,xj)jlim infjd(xi,xj)d(x0,xi)j=lim infjd(xi,xj)ji1Llim infjf(xj)2f(xi)=1L(κ)2f(xi)iκL.(3.20)

By (3.19) and (3.20), we have

κ*Llim infid(x0,xi)ilim supid(x0,xi)iκ*L.

Part (2). As in the proof of Proposition 3.4 part (2), by Theorem 3.7 and the above part (1), we have

limiexpx01xii=limiexpx01xid(x0,xi)d(x0,xi)i=κ*ξ*L.

For convergence of f(xi), the same property of Proposition 3.5 holds:

Proposition 3.11.

Suppose that κ*infxMf(x)>0. Let ξ*Sx0 denote the representative of the unique minimizer of f over MSx0. Then it holds that

lim infiτxix0f(xi)+κ*ξ*=0.

Question 3.12.

Does limiτxix0f(xi)=κ*ξ* hold?

Proof of Proposition 3.11.

Let did(x0,xi). We first show

lim supidi+1di=κ/L.(3.21)

Indeed, by the triangle inequality and Theorem 3.7 part (1), we have lim supidi+1dilim supid(xi,xi+1)=lim supif(xi)/L=κ/L. On the other hand, by Proposition 3.10 part (1), it holds that κ*/L=lim supidi/ilim supidi+1di, where the inequality follows from (2.1) for aidi+1di.

Consider the geodesic triangle of vertices x0,xi1,xi. Let γi denote the unit-speed geodesic from x0 to xi. Let θi denote the angle at vertex xi of this triangle. Then

θi=(γ˙i(di),τxi1xif(xi1)).

By the law of cosines in CAT(0) space M (see, e.g., Bridson and Haefliger [9, II.1.9 (2)]), we have

cosθidi2+d(xi1,xi)2di122did(xi1,xi)=d(xi1,xi)2di+12(1+di1di)didi1d(xi1,xi).

Take lim supi in this inequality. By di=d(x0,xi), d(xi1,xi)=f(xi1)/Lκ*/L (from Theorem 3.7 part (1)), di1/di1 (seen from Proposition 3.10 part (1)), and (3.21), we have lim supicosθi1, and lim infiθi=0. By Lemma 3.9 part (2), it holds that (f(xi),τxi1xif(xi1))0 and

lim supi(γ˙i(di),f(xi))=π.

By taking parallel transport τxix0 and γ˙i(0)ξ*, we have the claim. □

Remark 3.13.

Another type of discrete gradient flow, well-studied in the literature of nonpositively curved space (see Bačák [5], Mayer [47], and Ohta and Pálfia [52]), is defined via the resolvent map Jλf:MM,

Jλf(x)arg minyMf(y)+12λd(x,y)2(xM),(3.22)
where λ is a positive parameter. Let λi be a sequence of positive reals (satisfying λi0 and iλi). Then a discrete analogue (proximal point method) of gradient flow is as follows:
xi+1=Jλif(xi)(i=0,1,).(3.23)

For our manifold case, it can be written as an implicit difference scheme:

xi=expxi+1λif(xi+1).(3.24)

Several nice (convergence) properties are known for the sequence of (3.23). For example, the contraction property (3.13) holds for the semigroup of (3.23) (see Bačák [5, theorem 2.2.23]). On the other hand, solving (3.22) is a nontrivial task from an algorithmic point of view.

Remark 3.14.

In the case of M=Rn, Lemma 3.8 part (1) can be easily obtained from a known inequality. For an L-smooth convex function f in Rn, the following inequality holds (e.g., Beck [6, theorem 5.8 (iii)]):

f(y)f(x)f(x),yx+12Lf(x)f(y)2(x,yRn),
though we do not know a reasonable manifold version to hold. By substituting x=xi+1,y=xi, and using xixi+1=f(xi)/L and f(xi)f(xi+1) (Lemma 3.8 part (2)), we have Lemma 3.8 part (1):
f(xi+1)f(xi)12L(f(xi)2+f(xi+1)2)f(xi)1Lf(xi+1)2.

3.3. Euclidean Specialization

Here, we present refinements of the above results for the Euclidean setting M=Rn. As far as our knowledge, the above convergence results on the gradient flow/descent seem new even in this special case, and are further sharpened as follows. In the Euclidean space M=Rn, the tangent space Tx is also identified with Rn for every xM, where the inner product is given by u,vuv. The parallel transport τγ for any path γ is the identity map. Let f:RnR be a (smooth) convex function. We assume L-smoothness of f when Gradient Descent (3.11) is considered. The gradient f(x)Rn and Hessian 2f(x)Rn×n are obtained by (f(x))i=(/xi)f(x) and (2f(x))ij=(2/xixj)f(x), respectively.

In this setting, the strong duality (Corollary 3.3) is written as

infpf(Rn)¯p=supuRn:u1f(u),(3.25)
where f(Rn)¯ is the closure of the gradient image f(Rn)={f(x)xRn}. This relation itself is deduced within Euclidean convex analysis as follows. Let f*:RnR{} be the Legendre-Fenchel conjugate of f:
f*(p)sup{p,xf(x)xRn}(pRn).

Then, the gradient space f(Rn)¯ is equal to the closure domf*¯ of the domain domf*{pRnf*(p)<} of f*. Indeed, this is because f(Rn)domf*f(Rn)¯, where the first inclusion follows from p=f(x)f*(p)=p,xf(x) and the second from f*(p)<infxRnf(x)p,x>infxRnf(x)p=0. Also, it is known in convex analysis (Rockafellar [53, theorems 13.1 and 13.3]) that f is equal to the support function of domf*. Summarizing, it holds that

f(Rn)¯=domf*¯={pRnu,pf(u)(uRn)}.(3.26)

In particular, the gradient space f(Rn)¯ is (closed) convex. Now, the equality in (3.25) is attained by the (uniquely determined) minimum-norm point p* of f(Rn)¯ and its negative direction p*/p*; see the proof of the next theorem. By Theorems 3.1 and 3.7, both f(x(t)) and f(xi) converge to p*, and both x(t) and xi converge to p*/p* in cone topology.

Theorem 3.15.

Let p* denote the minimum-norm point of f(Rn)¯. Suppose that κ*infxRnf(x)>0.

  • (1) f(x(t)) converges to p*, and x(t)/t converges to p*.

  • (2) f(xi) converges to p*, and xi/i converges to p*/L.

Proof.

It suffices to show the claims for x(t)/t and xi/i. We first verify that the unique minimizer of f over the unit sphere is written as p*/p*u*. Observe from the KKT-condition that {pRnu*,p=f(u*)} is a supporting hyperplane of f(Rn)¯ at p*. Then, for any unit vector v, it holds that f(v)v,p*p*=u*,p*=f(u*). In particular, p* and u*=p*/p* attain the equality in (3.25).

Then, by Theorem 3.1, we have limtx(t)=p*/p* “in cone topology.” This implies that

p*p*=limtx(t)x0x(t)x0=limtx(t)ttd(x(t),x0)=limtx(t)t1p*,(3.27)
where the last equality follows from Proposition 3.4 with p*=limtf(x(t))=κ*. Thus, we have the latter part of Theorem 3.15 part (1). The latter part of part (2) is analogously shown by using Theorem 3.7 and Proposition 3.10 (for the sequence version of (3.27)). □

Because p*=κ*ξ*, the expected convergence in Questions 3.6 and 3.12 holds in this case. We end this section with other interesting aspects.

3.3.1. Hessian Riemannian Gradient Flow.

Here we point out that the convergence of f(x(t)) to the minimum-norm point p* can also be explained via the theory of Hessian Riemannian gradient flows by Alvarez et al. [2]. Suppose for simplicity that the Hessian 2f(x) is nonsingular for every xRn. Then, by the inverse mapping theorem applied to xf(x) (with the inverse pf*(p)), we see that f(Rn) is an open (convex) set.

Consider the continuous gradient flow x(t), and let p(t)f(x(t)). One more differentiation in (3.1) yields

p˙(t)=2f(x(t))p(t).

From 2f(x(t))=(2f*(p(t)))1, we have the following ordinary differential equation (ODE) obeyed by p(t):

p˙(t)=(2f*(p(t)))1p(t),p(0)=f(x0).(3.28)

This can be interpreted as a gradient-flow ODE on a Riemannian manifold. Define a Riemannian metric ,f on open convex set f(Rn) by

u,vfu,2f*(p)v(u,vTp=Rn,pf(Rn)).(3.29)

In this metric, the gradient fg(p) of g:f(Rn)R is given by (2f*(p))1g(p). Then (3.28) is viewed as the gradient flow of the squared-norm function pp2/2:

p˙(t)=fp(t)22,p(0)=f(x0).(3.30)

This is a particular instance of Hessian Riemannian gradient flow in Alvarez et al. [2]. Then, by Alvarez et al. [2, proposition 4.4], the solution p(t) of (3.30) minimizes p2/2 over f(Rn)¯ in limit t, which proves limtf(p(t))=p*, the first part of Theorem 3.15 part (1).

3.3.2. Mirror Descent.

On the other hand, the discrete version (Theorem 3.15 part (2)) can be explained from the framework of mirror descent (Nemirovsky and Yudin [49]), where we consult Bubeck [10, chapter 4] for it. Consider a general optimization problem

Min.g(p)s.t.pD,(3.31)
where g is a differentiable convex function on an open convex set DRn. A mirror map Φ:DR is a differentiable strictly convex function such that Φ:DRn is bijective and Φ(p) if p goes to the boundary of D. A basic form of mirror descent produces the sequence p1,p2, in D according to the update
Φ(pi+1)Φ(pi)βig(pi),(3.32)
where βi>0 is a step-size. It is well-known (see, e.g., Vishnoi [60, section 7.4]) that this update coincides with the proximal gradient descent relative to the Bregman divergence DΦ(q,p)Φ(q)Φ(p)Φ(p),qp:
pi+1arg minpD{g(pi)+g(pi),ppi+1βiDΦ(p,pi)}.(3.33)

Under several assumptions on g,Φ, the solution pi (or the average solution (1/i)j=1ipj or the best solution ever) is shown to converge to a minimizer of g (see, e.g., Lu et al. [46], Vishnoi [60, chapter 7], Bubeck [10, theorem 4.2], and Beck [6, section 9.2]).

Now, consider the setting g(p)p2/2 and Df(Rn). That is, (3.31) is the minimum-norm point problem on f(Rn)¯. As a mirror map, we can choose the Legendre-Fenchel conjugate Φf*D. Then, the update (3.32) becomes

f*(pi+1)f*(pi)βipi.(3.34)

Define xiRn by xif*(pi). Because pi=f(xi), (3.34) becomes

xi+1xiβif(xi).(3.35)

This is nothing but gradient descent, where the above Hessian Riemannian gradient flow is viewed as the continuous limit 2f*(p(t))p˙(t)=p(t) of (3.34). Then, the first part of Theorem 3.15 part (2) can be deduced from Lu et al. [46, theorem 3.1]. Furthermore, an O(1/i) convergence rate is obtained if f*(p*)< ( Df*(p*,p)<). See Sakabe [54] for details.

It may be interesting to develop a manifold analogy of these observations, which may use the space f(M)CM in Hirai [30]. Related to this issue, in Section 4.1, we will consider an analogous gradient flow (Kirwan’s flow) in the complex projective space P(V).

3.3.3. Matrix Scaling and Geometric Programming.

The matrix scaling problem (Sinkhorn [58]) is this: For a given nonnegative matrix A=(aij)R+n×n, find positive diagonal matrices (scaling matrices) X, Y such that XAY approximates a doubly stochastic matrix, that is, (XAY)110 and (XAY)110. Define a convex function fA:Rn×RnR by

fA(x,y)logi,jaijexi+yj1x/n1y/n(xRn,yRn).(3.36)

From fA(x,y)=(XAY11,(XAY)11)/n for (X,Y)(ediagx,ediagy)n/i,jaijexi+yj, the required scaling matrices X, Y are obtained from (x,y) having small gradient-norm fA(x,y). Particularly, such a point (x,y) is obtained by minimizing fA.

This matrix scaling optimization falls into a more general class of convex optimization, called geometric programming, to which our results are applicable. A geometric program asks to minimize a function f:RnR of the following form:

f(x)=log=1Naeωx(xRn),(3.37)
where a>0 and ωRn for =1,2,,N. It is well-known (see, e.g., Bürgisser et al. [11]) that
  • f is L-smooth convex with Lmaxω2, and

  • f(Rn)¯=Conv{ω}[N].

Therefore, with L=2, by Gradient Descent (3.11) applied to (3.36), the gradient sequence fA(xi) converges to the minimum-norm point p* of Conv{ei+eji,j:aij>0}.

We will show in Section 4.2 for the general setting of operator scaling that the point p* and the limit of XAY are characterized by a canonical block-triangular form of A, known as (an extended version of) the DM-decomposition (Dulmage and Mendelsohn [17]; see also Murota [48, section 2.2.3]). A similar convergence property was earlier shown by Hayashi et al. [29] for the Sinkhorn algorithm (Sinkhorn [58]), the standard alternating minimization algorithm for (3.36), in which the gradient fA(x,y) and the scaled matrix XAY oscillate between two limit points described by the DM-decomposition.

4. Application

4.1. Norm-Minimization in Reductive Group Action

We consider the formulation of noncommutative optimization in Bürgisser et al. [13]; see also Hirai et al. [34]. Let GGLn be a connected reductive algebraic group over C, where we assume that it is self-adjoint G=G (via conjugation (Wallach [61, theorem 3.13])). Its Lie algebra g is the complexification of the Lie algebra k of a maximal compact subgroup K=GUn as g=k+ik, where ikpn. The inner product , on g is defined by X,YRetrXY. Let V be a finite-dimensional vector space over C. Let π:GGL(V) be a rational representation, where Π denotes its Lie algebra representation: Π(X)(d/dt)π(etX)t=0. Consider a K-invariant Hermitian inner product (,) and the associated norm ·=(·,·) on V. The norm-minimization problem over the orbit π(G)v of vV{0} is given by

inf.π(g)vs.t.gG.(4.1)

It turned out (e.g., Bürgisser et al. [13]) that this class of optimization problems has numerous, sometimes unexpected, applications and connections in various fields of mathematical sciences. The most fundamental problem is to ask whether the infimum is zero, that is, whether the origin 0 is in the orbit closure π(G)v¯. This is the semistability problem in GIT. The representation π gives rise to a Hamiltonian action (g,[v])[π(g)v] on the complex projective space P(V). The corresponding (modified4) moment map μ:Vik is given by

μ(v),H(v,Π(H)v)(v,v)(vV,Hik),(4.2)
where μ may be regarded as P(V)ik. The following theorem is fundamental:

Theorem 4.1

(Kempf-Ness Theorem, Hilbert-Mumford Criterion; see Georgoulas et al. [24, Theorem 8.5 (i), Theorem 12.4]). For vV{0}, the following conditions are equivalent:

  • (i) infgGπ(g)v=0.

  • (ii) infgGμ(π(g)v)>0.

  • (iii) There is a 1-parameter subgroup te(t) of G such that limtπ(e(t))v=0.

The orbit π(G)v in this situation is called unstable. Otherwise, it is called semistable. Accordingly, we call the 1-parameter subgroup e(t) in (iii) a destabilizing 1-PSG.

The unstability corresponds to the lower-unboundedness of the Kempf-Ness function Fv on the group G defined by

Fv(g)12logπ(g)v2(gG).(4.3)

Because · is K-invariant, the Kempf-Ness function is viewed as a function on the symmetric space K\G. By π(g)v2=(π(gg)v,v) and K\GPnG by Kggg, we may consider the following version of the Kempf-Ness function fv on PnG:

fv(x)log(π(x)v,v)(xPnG).(4.4)

It is clear that fv(gg)=2Fv(g). Then, fv is an L-smooth convex function such that the transported gradient of fv provides the moment map μ:

Lemma 4.2

(Bürgisser et al. [13]).

  • (1) fv is Nπ2-smooth convex, where Nπ is the maximum of the norm of a weight for π.

  • (2) τxIfv(x)=μ(π(x1/2)v).

The second property (2) is implicit in Bürgisser et al. [13] and follows from τxIH=x1/2Hx1/2 and fv(x),Hx=(d/dt)fv(x1/2etx1/2Hx1/2x1/2)t=0=μ(π(x1/2)v),x1/2Hx1/2I. In particular, for the Kempf-Ness function fv, the unboundedness is equivalent to the positivity of the minimum gradient-norm. Applying Corollary 3.3, we have:

Theorem 4.3.

infgGμ(π(g)v)=supξBIfv(ξ). If fv(ξ)<0, then tetξ is a destabilizing 1-PSG.

Proof.

infgGμ(π(g)v)=infxPnGμ(π(x1/2)v) follows from μ(π(ug)v)=uμ(π(g)v)u for uK, the polar decomposition g=ux for uK, xPnG, and xPnGxaPnG (because G is algebraic). The latter part can be seen from the definitions of the Kempf-Ness function (4.4) and the recession function (2.4). □

As seen below, this is a part of the theory of moment-weight inequality (Georgoulas et al. [24]), in which the recession function fv is essentially Mumford’s numerical invariant, called the μ-weight; see Lemma 4.13.

Consider applying gradient descent to fv:

xk+1=expxk(1Lfv(xk)),x0=I,(4.5)
where LNπ2. In this setting, updating group elements gk in G may be more suitable:
gk+1=e12Lμ(π(gk)v)gk,g0=I.(4.6)

This is the first-order algorithm in Bürgisser et al. [13]. Each of the two updates (4.5) and (4.6) has its own advantage. Their relation is given by:

Lemma 4.4.

xk=gkgk.

Proof.

If g+=e12Lμ(π(g)v)g and g=ux1/2 for uK, xPnG, then it holds that g+g+=ge1Lμ(π(g)v)g=x1/2ue1Lμ(π(ux1/2)v)ux1/2=x1/2e1Lμ(π(x1/2)v)x1/2=expx1Lfv(x), where the third inequality follows from μ(π(u)v)=uμ(v)u and the fourth from Lemma 4.2 part (2). □

For the semistable case, Bürgisser et al. [13] showed its iteration complexity to compute infgGπ(g)v and to find gG with μ(π(g)v)0. For the unstable case, our result (Theorem 3.7) implies that Gradient Descent (4.5) constructs a destabilizing 1-PSG in the limit, which is maximally destabilizing in the sense that it is obtained from the unique minimizer of fv over SI(PnG) (recall that SI denotes the unit sphere in TI). This special 1-PSG is the same as the one shown by Kempf [42].

Theorem 4.5.

Suppose that infgGπ(g)v=0. Let xk be the sequence of (4.5), and let uk be the sequence defined by xk=ed(xk,I)uk. Then uk converges to the unique minimizer ξ* of fv over SI, where tetξ* is a maximally destabilizing 1-PSG.

Unfortunately, because fv is not necessarily (upper semi)continuous, this theorem does not imply the algorithmic statement that tetuk is a destabilizing 1-PSG for some large k. Therefore, we need a certain rounding idea to obtain a destabilizing 1-PSG from uk. We see in Section 4.2 that such a rounding is possible for the left-right action.

We also consider convergence of the moment-map sequence μ(π(gk)v). Let Cπik=TI(PnG) denote a positive Weyl chamber: it is a convex cone with the property that for any Hik there is a unique point in Cπ, denoted by specH, satisfying specH=kHk for some kK. The moment polytope ΔvCπ is defined as the closure of the image of gspecμ(π(g)v):

Δv{specμ(π(g)v)gG}¯.

The convexity theorem by Guillemin and Sternberg [25], Guillemin and Sternberg [26], and Kirwan [43] says that it is a convex polytope.

Theorem 4.6

(Convexity Theorem (Guillemin and Sternberg [25], Guillemin and Sternberg [26], Kirwan [43])). Δv is a convex polytope.

By Lemma 4.2 part (2), the polar decomposition g=ux1/2 for gG, uK, xPnG, and μ(π(ux1/2)v)=uμ(π(x1/2)v)u, it holds that

infxPnGfv(x)=infxPnGμ(π(x1/2)v)=infgGμ(π(g)v)=infgGspecμ(π(g)v)=infpΔvp(4.7)

By Theorem 3.7, we have the convergence of specμ(π(gk)v)(=specμ(π(xk1/2)v)) along the gradient-descent trajectory, which is an analogue of Theorem 3.15 part (2).

Theorem 4.7.

Let p* be the minimum-norm point of Δv, and let Hk be the sequence defined by xk=ekHk/L. Suppose that infgGπ(g)v=0. Then, both specμ(π(gk)v) and spec(Hk) converge to p* for k.

Proof.

It suffices to show the claim for Hk. By Proposition 3.10 part (2), Proposition 3.11, and Lemma 4.2 part (2), it holds that

lim infkμ(π(xk1/2)v)+Hk=0.

Because specμ(π(xk1/2)v) converges to p* and Hk converges (to κ*ξ*), it must hold that spec(Hk) converges to p*. □

Question 3.12, if it is true, would imply the stronger convergence limkμ(π(xk1/2)v)=limkHk.

4.1.1. Moment-Weight Inequality and Gradient Flow of Moment-Map Squared.

Clearly, via Theorem 3.1, the above results (Theorems 4.5 and 4.7) hold for the gradient flow:

x˙(t)=fv(x(t)),x(0)=I.(4.8)

Our consideration of this case falls into the theory of moment-weight inequality by Georgoulas et al. [24], which builds upon the earlier work by Kempf, Kirwan, Mumford, and Ness in GIT, and the recent work by Chen and Sun [15] in K-stability. Here, we briefly summarize the relation by deducing an important part of the theory from our results in Section 3.1. We use notation g·[v][π(g)v] for the action on P(V). According to Georgoulas et al. [24, chapter 3], consider the gradient flow (Kirwan’s flow) of the squared-norm of the moment map on P(V):

ζ˙(t)=μ(ζ(t))22,ζ(0)=[v].(4.9)

This is the gradient flow of a real analytic function ζμ(ζ)2/2 on a compact Riemannian manifold P(V) (with respect to the Fubini-Study metric). By the standard argument of the Łojasiewicz gradient inequality, the limit of ζ(t) exists.

Theorem 4.8

(Convergence Theorem (Georgoulas et al. [24, Theorem 3.3])). The limit ζlimtζ(t) exists.

Further, the limit ζ attains the infimum of the moment-map norm over the orbit G·[v] in P(V).

Theorem 4.9

(Moment-Limit Theorem (Georgoulas et al. [24, Theorem 6.4])). Let ζ(t) be the solution of (4.9), and let ζlimtζ(t). Then it holds that

μ(ζ)=infgGμ(g·[v]).(4.10)

The equality (4.10) can be understood from Theorem 3.1 as follows. Regard G as a Riemannian manifold by the right-invariant Riemannian metric X,YgRetrXg1(Yg1) for X,YTg,gG, and consider the gradient flow of Fv on G:

g˙(t)=Fv(g(t)),g(0)=I.(4.11)

Then, the solution ζ(t) is obtained from the action of g(t) as follows:

Theorem 4.10

(Georgoulas et al. [24, Theorem 4.1 (ii)]). The solution ζ(t) of (4.9) is represented as ζ(t)=g(t)·[v] for the solution g(t) of (4.11).

Proof sketch.

Define φ:GP(V) by gg·[v]. Then, by adapting Georgoulas et al. [24, (4.3)] with our notation, it holds that dφgFv(g)=μ(g·[v])22. Thus, (d/dt)(g(t)·[v])=(d/dt)φ(g(t))=dφg(t)g˙(t)=dφg(t)Fv(g(t))=μ(g(t)·[v])22, implying that g(t)·[v] is the solution ζ(t) of (4.9). □

We can see that Fv(g)=μ(π(g)v)g and (4.6) is the discretization (gradient descent) of (4.11). Analogously to Lemma 4.4, the relation between x(t) and g(t) is given by:

Lemma 4.11.

x(2t)=g(t)g(t).

Proof.

For HTgg(PnG), it holds thatfv(gg),Hgg=ddtt=0fv(getgHg1g) =ddtt=02Fv(etgHg1/2g)=Fv(g),gHg=gFv(g)+Fv(g)g,H/2gg. Hence, it holds that 2fv(gg)=gFv(g)+Fv(g)g, and

dds(g(s)g(s))=g˙(s)g(s)+g(s)g˙(s)=Fv(g(s))g(s)g(s)Fv(g(s))=2fv(g(s)g(s)).

Thus, x(t)g(t/2)g(t/2) satisfies (4.8). □

Therefore, the moment-limit theorem (Theorem 4.9) follows from μ(ζ)=limt μ(π(g(t))v)=limtμ(π(x(t)1/2)v)=infxPnGfv(x(t))=infgGμ(g·[v]). Accordingly, an analogue of Theorem 3.15 part (1) (or the continuous version of Theorem 4.7) is the following.

Theorem 4.12.

Let p* be the minimum-norm point of Δv, and let H(t) be the function defined by x(t)=etH(t). Suppose that infgGπ(g)v=0. Then, both specμ(π(g(t))v) and spec(H(t)) converge to p* for t.

Proof.

It suffices to show the claim for H(t). By Proposition 3.4 part (2), Proposition 3.5, and Lemma 4.2 part (2), it holds that

lim inftμ(π(x(t)1/2)v)+H(t)=0.(4.12)

The rest is the same as in the proof of Theorem 4.7. □

Contrary to g(t)·[v], we do not know whether x(t)1/2·[v] converges.5 At least, if Question 3.6 is affirmative, then μ(π(x(t)1/2)v) converges. On the other hand, μ(g(t)·[v]) converges to μ(ζ), H(t) converges to H(=κ*ξ*), and they have the same spectrum p*. Therefore, there is uK such that uμ(ζ)u=H. This fact is a part of the generalized Kempf existence theorem (Georgoulas et al. [24, theorem 10.4, (10.9)]). In particular, Theorem 4.7 can be viewed as a discrete version of the moment-limit theorem, though we do not know whether ζkgk·[v] converges.

We next explain the moment-weight inequality. The (restriction of) μ-weight wμ:P(V)×ikR{} is defined by

wμ([v],H)limttrμ(π(etH)v)H([v]P(V),Hik=TI(PnG)),(4.13)
where the existence of the limit is seen in the proof of the next lemma. The μ-weight is nothing but the recession function of fv.

Lemma 4.13

(See Georgoulas et al. [24, Lemma 5.2]). wμ([v],H)=fv(H).

Proof.

By recalling (2.4), it holds that fv(H)=limt(d/dt)fv(etH)=limttrμ(π(etH)v)H=wμ([v],H), where the second equality follows from Lemma 4.2 part (2). □

We now state the main part of the theory of moment-weight inequality (for linear actions).

Theorem 4.14

(Moment-Weight Inequality (Georgoulas et al. [24, Theorems 6.7, 10.1, 10.2, 10.4])). It holds that

infgGμ(g·[v])supHik{0}wμ([v],H)H.(4.14)

Suppose that κ*infgGμ(g·[v])>0. Then the equality in (4.14) holds, and the supremum is attained by unique H*ik with H*=1, obtained as follows: Let H(t) be defined by x(t)=etH(t) for solution x(t) of (4.8). Then the limit HlimtH(t) exists, H=κ*, and H*=H/H.

From our convex optimization perspective, the moment-weight inequality (4.14) is explained by the weak duality (Lemma 2.2). The equality case is explained by the strong duality (Theorem 3.1), the gradient-flow construction of the unique minimizer of fv, and the formula of the velocity of escape (Proposition 3.4).

We finally state one well-known important uniqueness property of minimizers of the moment-map norm over G·[v]¯,

Theorem 4.15

(Second Ness Uniqueness Theorem (Georgoulas et al. [24, Theorem 6.5])). For ζ,ζG·[v]¯, if μ(ζ)=μ(ζ)=infgGμ(g·[v]), then ζK·ζ.

In the next subsection, we characterize such minimizers for the left-right action.

4.2. Operator Scaling and Its Gradient-Flow Limit

Let A=(A1,A2,,AN)CN(n×m) be an N-tuple of n×m matrices over C. Let pR+n, qR+m be nonnegative vectors with the same sum ipi=jqj, where p, q are arranged as

p1p2pn,q1q2qm.(4.15)

The operator scaling problem, originally introduced by Gurvits [27] for p=q=1 and extended by Franks [18] for general p, q, is to ask: For a given accuracy ϵ0, find gGLn,hGLm such that

=1NgAhhAgdiagp2+=1NhAggAhdiagq2ϵ2,(4.16)
where the norm is the Frobenius norm. A matrix tuple A is said to be (approximately) (p,q)-scalable if for every positive ϵ>0 there are gGLn,hGLm satisfying (4.16). If some g, h satisfy (4.16) for ϵ=0, then A is called exactly (p,q)-scalable, and gAh is called a (p,q) -scaling of A. The operator scaling is a quantum generalization of the matrix scaling, and turns out to have rich applications (see Franks [18], Garg and Oliveira [21], and Garg et al. [22, 23]). For simplicity, we assume that the left and right common kernels of A are both trivial: kerA={0} and kerA={0}.

In view of the previous section, the operator scaling is interpreted as the moment polytope membership of the left-right action π:SLn×SLmGL(CN(n×m)) defined by

π(g,h)(B)gBh=(gB1h,gB2h,,gBNh),(4.17)
where B=(B1,B2,,BN)CN(n×m). A maximal compact subgroup K of SLn×SLm is given by SUn×SUm, and a K-invariant Hermitian product , on V=CN(n×m) is given by B,C=1NtrBC. From Π(X,Y)(B)=XB+BY, we see that the moment map μ:CN(n×m)pn1×pm1 is given by
μ(B)=(μ1(B),μ2(B))=1B2(=1NBB,=1NBB)(1nI,1mI).(4.18)

A positive Weyl chamber is taken as the set of diagonal matrices (diagp,diagq) with p, q satisfying (4.15) and 1p=1q=0. We regard it as a subset of Rn×Rm. Then the moment polytope ΔA consists of vectors of eigenvalues of μ(B) over BSLn·A·SLm¯ (the closure of the SLn×SLm-orbit of A). Comparing (4.18) with (4.16), we have:

Lemma 4.16.

A is (p,q)-scalable if and only if (p/c1/n,q/c1/m) belongs to ΔA, where cipi=jqj.

We consider the operator scaling problem for the most basic case: (p,q)=(1/n,1/m). Then, it holds that

A is (1/n,1/m)-scalable(0,0)ΔAinfg,hμ(gAh)=0.

Accordingly, the Kempf-Ness theorem (Theorem 4.1) links with the (1/n,1/m)-scaling problem, and is sharpened as follows. Let SA denote the family of pairs of vector subspaces XCn, YCm such that uAv¯=0 for all uX, vY, [N]. This is (essentially) the same as the family of independent subspace pairs in Franks [18] and Franks et al. [19]. Although SA is an infinite set, it turns out in Lemma 4.22 that a certain maximal subset EA of SA is a finite set.

Theorem 4.17

(Characterization of Scalability (Gurvits [27])). The following are equivalent:

  • (i) infgSLn,hSLmgAh>0.

  • (ii) A is (1/n,1/m)-scalable.

  • (iii) For all (X,Y)SA, it holds that (1/n)dimX+(1/m)dimY1.

This theorem was originally stated for the case n=m, in which the condition (iii) is simply written as dimYdim=1NAkY for every subspace Y. A subspace violating this condition is called a shrunk subspace in Franks et al. [19], Garg et al. [23], and Ivanyos et al. [37, 38]. The nm generalization is straightforward and is included in more general results for the operator scaling with marginals by Franks [18].

A vector-space pair (X,Y)SA violating part (iii) actually gives rise to a destabilizing 1-PSG as follows: Choose σSUn and τSUm such that the first r rows of σ span X and the first s rows of τ span Y, where (r,s)(dimX,dimY). Then one can see that t(etdiag(1[r](r/n)1)σ,etdiag(1[s](s/m)1)τ) is a destabilizing 1-PSG.

Further, the strict inequality in part (iii) brings exact scalability.

Theorem 4.18

(Exact Scalability (Gurvits [27])). If (1/n)dimX+(1/m)dimY<1 for all (X,Y)SA other than ({0},Cm) and (Cn,{0}), then A is exactly (1/n,1/m)-scalable.

The exact case corresponds to the existence of g, h with μ(gAh)=0. By Lemma 4.2 part (2), this is the case where the Kempf-Ness function fA has an optimum (= a point of zero gradient). Then, Theorem 4.18 can be deduced from General Property (2.5) of the recession function fA (given explicitly in (4.21) below). Here, the Kempf-Ness function fA:Pn1×Pm1R is written as

fA(x,y)logtr=1NxAyA(xPn1,yPm1).(4.19)

Lemma 4.19

(Bürgisser et al. [13]). fA is 2-smooth convex.

Now Theorem 4.3 (Corollary 3.3, or the moment-weight inequality (Theorem 4.14)) sharpens (ii) (iii) of Theorem 4.17 in the following min-max (inf-sup) form:

Theorem 4.20

(Duality Theorem for the Scalability Limit of Operator Scaling).

infg,h(=1NgAhhAg1nI,=1NhAggAh1mI)=supa,b,σ,τmax{ai+bj,(σAτ)ij0},(4.20)

where the infimum in the left-hand side (LHS) is taken over all gGLn, hGLm with gAh=1 and the supremum in the right-hand side (RHS) is taken over all σSUn, τSUm, aRn, bRm with (a,b)1 and 1a=1b=0.

Inspired by this formula, Hirai [32] obtained a cleaner formula by using the trace norm instead of the Frobenius norm.

Proof.

It suffices to show that fA is equal to the objective function of the RHS in (4.20). Here (G,H)pn1×pm1 is written as (G,H)=(σdiagaσ,τdiagbτ) for σSUn,τSUm, aRn, bRm with 1a=1b=0. Then we have

fA(G,H)=limt1tlogtretGAetHA=limt1tlog,i,j|(σAτ)ij|2et(ai+bj)=max{ai+bj,(σAτ)ij0},(4.21)
where we used limt1tlogkeck+tdk=maxkdk in the last equality. □

In the sequel, we assume that A is not (1/n,1/m)-scalable, and analyze the asymptotic behavior of gradient descent for fA:

(xk+1,yk+1)=expxk,yk(1LfA(xk,yk)),(x0,y0)=(I,I),(4.22)
where we let L2 by Lemma 4.19. The corresponding group update (4.6) in SLn×SLm is given by
(gk+1,hk+1)=(e12Lμ1(gkAhk)gk,e12Lμ2(gkAhk)hk)(g0,h0)=(I,I).(4.23)

Then (xk,yk)=(gkgk,hkhk) by Lemma 4.4. We address the following problem.

Problem 4.21.

Characterize the following (A), (B), and (C):

  • (A) The limit of specμ(gkAhk) (= the minimum-norm point of ΔA).

  • (B) The limit of (xk,yk) in cone topology (= the unique minimizer of fA).

  • (C) The limit of [gkAhk] in P(CN(n×m)) (= the minimizer of the moment-map norm μ over [SLn·A·SLm]¯).

We show that these are characterized by a certain simultaneous block-triangular form of A. This block-triangular form is a vector-space generalization of the classical Dulmage-Mendelsohn decomposition (Dulmage and Mendelsohn [17]) for a bipartite graph and its associated matrix. We introduce our generalized DM-decomposition in a way analogous to Hayashi et al. [29, section 3] for the classical setting, where the essential idea of the construction can be partly found in Ito et al. [36]. Iwamasa et al. [39] pointed out that our DM-decomposition is a special case of the Harder-Narasimhan filtration for generalized Kronecker quivers.

Recall the family SA defined before Theorem 4.17. Define a map ϕ:SAR+2 by

ϕ(X,Y)(dimX,dimY)((X,Y)SA).

Consider the convex hull Convϕ(SA)R+2; see the left of Figure 1. Let EA denote the subset of (X,Y)SA such that ϕ(X,Y) is an extreme point of Convϕ(SA) not equal to (0,0).

Figure 1. Convϕ(SA) in (y,x)-plane (left) and a DM-decomposition of A (right). The slope nα/mα is increasing by the convexity of Convϕ(SA).
Lemma 4.22.

For (X,Y),(X,Y)EA, if dimXdimX and dimYdimY, then XX and YY. In particular, EA is a finite set, and ϕ is injective on EA.

Proof.

We may suppose that ϕ(X,Y) and ϕ(X,Y) are equal or on an adjacent pair of extreme points. Observe (XX,Y+Y),(X+X,YY)SA. By the dimension identity of vector spaces, it holds that

ϕ(XX,Y+Y)+ϕ(X+X,YY)=ϕ(X,Y)+ϕ(X,Y).(4.24)

We claim that X=X+X and Y=YY, which implies the statement. Otherwise, by (4.24), ϕ(XX,Y+Y) or ϕ(X+X,YY) goes beyond Convϕ(SA), which contradicts (XX,Y+Y),(X+X,YY)SA. □

Therefore, EA={(Xα,Yα)}α=0θ can be arranged as

Cn=X0X1Xθ={0},{0}=Y0Y1Yθ=Cm,(4.25)
where CnX1 and Yθ1Cm follow from the assumption that the common left and right kernels of A are trivial. For each α[θ], let LAα denote the subset consisting of (X,Y)SA such that ϕ(X,Y) belongs to the edge between ϕ(Xα1,Yα1) and ϕ(Xα,Yα). As in the proof of Lemma 4.22, we have:

Lemma 4.23.

If (X,Y),(X,Y)LAα, then (X+X,YY),(XX,Y+Y)LAα. In particular, LAα is a modular lattice with respect to the partial order (X,Y)(X,Y)XX,YY, where the minimum and maximum elements are given by (Xα1,Yα1) and (Xα,Yα), respectively.

For each α[θ], consider a maximal chain (flag) of LAα:

Xα1=Xα,0Xα,1Xα,θα=Xα,Yα1=Yα,0Yα,1Yα,θα=Yα,
where the length θα of the chain is uniquely determined by the Jordan-Dedekind chain condition. The union α=1θβ=0θα{(Xα,β,Yα,β)} is a maximal chain of the whole lattice LAα=1θLAα, and is called a DM-flag. Its subset EA is called the coarse DM-flag, which is uniquely determined by A. From a DM-flag, we obtain a simultaneous block-upper-triangular form of A as follows. Consider gGLn including, as row vectors, a basis of Xα,β for each α,β. Similarly, consider hGLm including, as row vectors, a basis of Yα,β for each α,β. Suppose that they are positioned in the last rows for g and first rows for h. Then, the matrices B=gAh are simultaneously block-triangularized, as in the right of Figure 1. We call B=(B) a DM-decomposition6 of A. When g (resp. h) is restricted to span only Xα (resp. Yα), it is called a coarse DM-decomposition of A.

For abuse of notation, Xα, Xα,β, Yα, and Yα,β also denote the index sets of the corresponding rows and columns of B. Define ordered partitions (Iα) of [n], (Jα) of [m], and their refinements (Iα,β), (Jα,β) by

IαXα1Xα,JαYαYα1(α[θ]),(4.26)
Iα,βXα,β1Xα,β,Jα,βYα,βYα,β1(β[θα]).(4.27)

Let B^=(B^) denote the matrix tuple of block-diagonal matrices obtained from B by replacing each (upper) off-diagonal block Bk[Iα,β,Jα,β] ((α,β)(α,β)) with the zero matrix. We call B^ a diagonalized DM-decomposition of A. A diagonalized version of a coarse DM-decomposition is defined analogously.

Let nα|Iα| and mα|Jα|. By convexity of Convϕ(SA), it holds that

n1m1<n2m2<<nθmθ.(4.28)

Define (p*,q*)Rn×Rm by

p*1n1+1CAα=1θmαnα+mα1Iα,q*1m1+1CAα=1θnαnα+mα1Jα,(4.29)
where the constant CA is defined by
CAα=1θnαmαnα+mαnmn+m,(4.30)
where the inequality is seen from the concavity of the harmonic mean (x,y)2(1/x+1/y)1. We see from (4.28)(4.30) that (p*,q*) belongs to the positive Weyl chamber:
p1*p2*pn*,q1*q2*qm*,1p*=1q*=0.(4.31)

Recalling Pn1PnSLn, define (G*,H*)pn1×pm1=TI,I(Pn1×Pm1) by

G*(σ*)diag(p*)σ*,H*(τ*)diag(q*)τ*,(4.32)
where σ* is a unitary matrix having a basis of Xα in the last nα rows and τ* is a unitary matrix having a basis of Yα in the first mα rows. By using these notions, we give a solution to Problem 4.21 parts (A) and (B):

Theorem 4.24.

  • (1) (p*,q*) is the minimum-norm point of ΔA, and

  • (2) (G*,H*)/(G*,H*) is the unique minimizer of fA over SI,I(Pn1×Pm1), where it holds that

    (p*,q*)2=fA(G*,H*)=1CA1n1m.(4.33)

Corollary 4.25.

Let (gk,hk) and (xk,yk) be the sequences in (4.23) and (4.22), respectively.

  • (1) specμ(gkAhk) converges to (p*,q*) for k.

  • (2) (xk,yk) converges, in cone topology, to (G*,H*)/(G*,H*). More precisely, the sequence (Gk,Hk) defined by (xk,yk)=(etGk/L,etHk/L) converges to (G*,H*) for k.

Proof of Theorem 4.24.

We first show (4.33). From the definitions of (p*,q*) and CA, we have

(p*,q*)+(1/n,1/m)2=1CA2α=1θnαmα2(nα+mα)2+mαnα2(nα+mα)2=1CA2α=1θnαmαnα+mα=1CA.

By the last equation in (4.31), we have

(p*,q*)2=(p*,q*)+(1/n,1/m)2(1/n,1/m)2=1/CA1/n1/m(>0).

On the other hand, B=σ*A(τ*) is a coarse DM-decomposition, that is, (σ*A(τ*))ij=0 for each (i,j)Iα×Jα with α>α. By (4.21) in the proof of Theorem 4.20, the value of the recession function fA(G*,H*) is given by

fA(G*,H*)=max{pi*qj*,(i,j)Iα×Jα:αα,(σ*A(τ*))ij0}.(4.34)

Observe from (4.28)(4.30) that

pi*qj*{=1/n+1/m1/CAif(i,j)Iα×Jα,<1/n+1/m1/CAif(i,j)Iα×Jα:α<α,>1/n+1/m1/CAif(i,j)Iα×Jα:α>α.(4.35)

Hence, the maximum in (4.34) is attained by the index of any nonzero element of any diagonal block of σ*A(τ*), which implies fA(G*,H*)=1/n+1/m1/CA, and (4.33).

To complete the proof, it suffices to show (p*,q*)ΔA because (p*,q*) and (G*,H*)/(G*,H*) would attain inf(p,q)ΔA(p,q)=sup(G,H)BI,IfA(G,H). This is done in the next proposition. □

Proposition 4.26.

Let B^ be a diagonalized DM-decomposition of A.

  • (1) B^ is exactly (p*+1/n,q*+1/m) -scalable.

  • (2) [B^][SLn·A·SLm]¯.

In particular, it holds that (p*,q*)ΔA.

Proof.

Part (1). We first show:

Claim. B[Iα,β,Jα,β] is exactly (1/|Iα,β|,1/|Jα,β|)-scalable.

Proof of Claim.

We can assume that A is already equal to a DM-decomposition B, where all Xα,β, Yα,β are coordinate subspaces. Suppose indirectly that B[Iα,β,Jα,β] is not exactly (1/|Iα,β|,1/|Jα,β|)-scalable. Then, by Theorem 4.18, there is nontrivial (Z,W)SB[Iα,β,Jα,β] such that (1/|Iα,β|)dimZ+(1/|Jα,β|)dimW1. Then (Xα,β+Z,Yα,β1+W) belongs to SA. However, ϕ(Xα,β+Z,Yα,β1+W) goes beyond Convϕ(SA) or lies on the interior of the segment between ϕ(Xα,β1,Yα,β1) and ϕ(Xα,β,Yα,β). The former case is obviously impossible. The latter case is also impossible because of the maximality of the chain {(Xα,β,Yα,β)} in LA. □

We observe from nα/mα=|Iα,β|/|Jα,β| that (mα,nα) is a constant multiple of (1/|Iα,β|,1/|Jα,β|). By the claim, for each α,β, we can choose scaling matrices gα,β,hα,β to make B[Iα,β,Jα,β] an exact (1/{CA(nα+mα)})(mα1,nα1)-scaling. Then, for gα,βgα,β, hα,βhα,β, the scaling gB^h is a desired (p*+1/n,q*+1/m)-scaling.

Part (2). Let B be a DM-decomposition of A, where BSLn·A·SLm. For each α,β and t>0, by B[Xα,β,Yα,β]=O, it holds that

(etdiag1Xα,βBetdiag1Yα,β1)ij={BijetifiXα,β,jYα,β,Bijotherwise.(4.36)

Let Rα,β|Xα,β|/n and Sα,β(|Yα,β|m)/m. For t>0, define atSLn and btSLm by

atetRetdiagα,β1Xα,β,btetSetdiagα,β1Yα,β1.

By (4.36), the scaling atBbt is written as

atBbt=e(R+S)t(B^+Et)
for the diagonalized DM-decomposition B^ of B and matrix Et converging to zero for t. This implies that limt[atBbt]=limt[B^+Et]=[B^][SLn·A·SLm]¯. Because B^ admits an exact (p*+1/n,q*+1/m)-scaling, B*=gB^h, by Lemma 4.16 and 1p*=1q*=0, we conclude that (p*,q*)ΔA. □

Now the sequence of the scaled matrices along the gradient-descent trajectory accumulates to the SUn×SUm-orbit of a diagonalized DM-decomposition B^, providing a (partial) solution of Problem 4.21 part (C):

Theorem 4.27.

Let B^ be a diagonalized DM-decomposition of A, and let B* be a (p*+1/n,q*+1/m)-scaling of B^. Then [gkAhk] accumulates to points in [SUn·B*·SUm] for k.

Proof.

It holds that μ(B*)=(diagp*,diagq*). Thus, B* attains the infimum of μ(B) over [B][SLn·A·SLm]¯, which is also the limit of μ(gkAhk). By the second Ness uniqueness theorem (Theorem 4.15), we have the claim. □

For the gradient flow (g(t),h(t)) of the Kempf-Ness function FA on the group SLn×SLm, because of the convergence theorem (Theorem 4.8), [g(t)Ah(t)] converges to a point σB*τ for some σSUn, sτSUm.

Although B* is also a diagonalized DM-decomposition of A, it is not clear how to remove the unitary indeterminacy from [gkAhk] and to extract the DM-structure of B*. This is possible for the coarse DM-structure as follows:

Theorem 4.28.

Let (Gk,Hk) be the sequence defined by (xk,yk)=(ekGk/L,ekHk/L). Suppose that Gk=σkdiagakσk and Hk=τkdiagbkτk for unitary matrices σk, τk and nondecreasing and nonincreasing vectors ak and bk, respectively. Then σkAτk accumulates to coarse DM-decompositions. The convergence is linear in the following sense: there are c>0, M>0 such that for all kM, [N] it holds that

|(σkAτk)ij|eck((i,j)Iα×Jα:α>α).

Proof.

By Theorems 3.7 and 4.24 and Lemma 3.9 part (1), it holds that

1L(1CA1n1m)=limkfA(xk,yk)2L=limkfA(xk+1,yk+1)fA(xk,yk)=limkfA(xk,yk)k,(4.37)
where the final equality follows from (2.1) for akfA(xk+1,yk+1)fA(xk,yk).

Because efA(xk,yk)=trxkAykA=,i,j|(σkAτk)ij|2e(aik+bjk)k/L, we have

,i,j|(σkAτk)ij|2e(aik+bjk)k/LfA(xk,yk)=1.

Suppose that the index (i,j) is in a lower triangular block. By (ak,bk)k(p*,q*) (Corollary 4.25 part (2)) and (4.37), it holds that

(aik+bjk)k/LfA(xk,yk)kk1L(pi*qj*1n1m+1CA)>0,
where the inequality follows from (4.35). Therefore, for some c>0 and M>0, it holds that (aik+bjk)k/LfA(xk,yk)ck for all k>M. Then |(σkAτk)ij|2eck1 for all kM. □

Remark 4.29.

Suppose that μ(xk1/2Ayk1/2) converges, or more strongly, the convergence of Question 3.12 is true. Then it holds that limkμ(xk1/2Ayk1/2)+(Gk,Hk)=0. This implies

limkμ(ediagak/2σkAτkediagbk/2)+(diagak,diagbk)=0.(4.38)

Because (ak,bk)(p*,q*), the scaling sequence A(k)(ediagak/2σkAτkediagbk/2)/gkAhk accumulates to (p*+1/n,q*+1/m)-scalings. From the coarse DM-structure of σkAτk in the limit, one can see that A(k) accumulates to diagonalized coarse DM-decompositions. Although our numerical experiment supports such convergence, our results imply only lim infk=0 in (4.38).

We end this subsection with some implications of these results.

4.2.1. On Finding a Destabilizing 1-PSG.

Suppose that A is not (1/n,1/m)-scalable. Consider (X*,Y*)EA mapped to the extreme point (r*,s*) of Convϕ(SA) with the property that it maximizes r among all extreme points (r,s) maximizing r+s. The subspace pair (X*,Y*) violates (iii) in Theorem 4.17 and is a special certificate of unscalability, called dominant in Franks et al. [19]. By Theorem 4.28, after a large number k of iterations, the last r* rows of σk and the first s* rows of τk become bases of an ϵ -approximate dominant pair (Xϵ*,Yϵ*) in the sense that |uAv¯|ϵ for all and all unit vectors uXϵ*,vYϵ*. Franks et al. [19] devised a procedure to round such an ep(n,m,N,b)-approximate dominant pair into the exact dominant pair (X*,Y*), where p is a polynomial and b is the bit complexity of A. Hence, if we would establish global linear convergence in Theorem 4.28, a polynomial number of iterations of Gradient Descent (4.22) would suffice to recover the dominant pair and a destabilizing 1-PSG.

4.2.2. Matrix Scaling Case.

An n×m matrix M=(aij) is viewed as a matrix tuple A=(aijeiej)ij:aij0. Consider the left-right action on A, in which the group is restricted to the subgroup STn×STmSLn×SLm consisting of diagonal matrices. The corresponding scaling problem is nothing but the matrix scaling problem of the nonnegative matrix (|aij|2); see Section 3.3. The above results are also applicable to this setting. Indeed, the gradient fA is a pair of diagonal matrices. Then, the gradient flow/descent belongs to the diagonal subspace in Pn1×Pm1, and is viewed as the gradient flow/descent for the geometric programming objective (3.36) in matrix scaling. Here, all subspaces Xα,Yα,Xα,β,Yα,β are coordinate subspaces. Hence, a DM-decomposition B is obtained by row and column permutations, and is equivalent to the original (extended) DM-decomposition of M. In Remark 4.29, the unitary matrices σk and τk are permutation matrices, and all lower triangular blocks of A(k) become zero matrices after finitely many iterations. Also, all upper triangular blocks of A(k) converge to zero matrices. In particular, the expected convergence to the diagonalized DM-decomposition B^ is true. This convergence property is almost the same as the one for the Sinkhorn algorithm. Indeed, Hayashi et al. [29] showed that this limit (Sinkhorn limit) oscillates between the (1,α(nα/mα)1Jα)-scaling Br* and (α(mα/nα)1Iα,1)-scaling Bc* of B^.

4.2.3. On the Limit of the Operator Sinkhorn Algorithm.

This suggests an expectation of the limiting behavior of the operator Sinkhorn algorithm (Gurvits’ algorithm), the standard algorithm for the operator scaling problem. The operator Sinkhorn algorithm is viewed as alternating minimization of fA(x,y), where each step scales AgA with μ(A)=(O,) and AAh with μ(A)=(,O) alternatively. When it is applied to the (p*+1/n,q*+1/m)-scaling B* of a diagonalized DM-decomposition B^, the resulting scaling sequence oscillates between the (1,α(nα/mα)1Jα)-scaling and (α(mα/nα)1Iα,1)-scaling of B*. With the view of Theorem 4.27 and the matrix scaling case above, it is reasonable to conjecture that it oscillates between orbits Un·Br*·Um and Un·Bc*·Um, where Br* (resp. Bc*) is a (1,α(nα/mα)1Jα)-scaling (resp. (α(mα/nα)1Iα,1)-scaling) of B^.

4.3. Kronecker Form of a Matrix Pencil

Finally, we discuss the special case of N=2, that is, A=(A1,A2). In this case, A is naturally identified with a matrix pencil sA1+A2C(s)n×m, where s is an indeterminate. Here we reveal a connection to the Kronecker canonical form of sA1+A2, and suggest a new numerical method for finding the Kronecker structure based on gradient descent.

A pencil sA1+A2 is called regular if n=m and det(sA1+A2)0 for some sC. Otherwise, sA1+A2 is called singular. For simplicity, we assume (again) that kerA1kerA2={0} and kerA1kerA2={0}. The Kronecker form is a canonical form of a (singular) pencil under transformation (sA1+A2)g(sA1+A2)h by gGLn, hGLm. The standard reference of the Kronecker form is Gantmacher [20, chapter XII]; see also Murota [48, section 5.1.3] for its importance in systems analysis. For a positive integer ϵ, define ϵ×(ϵ+1) matrix Lϵ by

(Lϵ)ij{1ifj=i,sifj=i+1,0otherwise.

Theorem 4.30

(Kronecker Form; Gantmacher [20, Chapter XII]). There are gGLn,hGLm such that

g(sA1+A2)h=Lϵ1Lϵ2Lϵc(sC+D)LηdLηd1Lη1,(4.39)
where sC+D is a regular pencil, and ϵ1,ϵ2,,ϵc, η1,η2,,ηd are positive integers determined as follows:
  • ϵj is the minimum degree of a polynomial vector xj(s) in kersA1+A2 that is linearly independent from x1(s),x2(s),,xj1(s) over C(s).

  • ηj is the minimum degree of a polynomial vector yj(s) in ker(sA1+A2) that is linearly independent from y1(s),y2(s),,yj1(s) over C(s).

The indices ϵ1ϵc,η1ηd, called the minimal indices, are uniquely determined. If n=m and sA1+A2 is singular, then the Kronecker form has a zero block with the sum of row and column numbers greater than n. Therefore, by Theorem 4.17, we have:

Corollary 4.31.

A pencil sA1+A2 is regular if and only if n=m and (A1,A2) is (1/n,1/n)-scalable.

We point out a further connection that the Kronecker form (4.39) is viewed as almost a DM-decomposition. Let b denote the number of diagonal blocks of gAh in (4.39). For γ[b], let Iγ and Jγ denote the row and column index sets, respectively, of the γ-th diagonal block of gAh. Define Xγ by the vector subspace spanned by the rows of g of indices in Iγ+1Iγ+2Ib. Similarly, define Yγ by the vector subspace spanned by the rows of h having indices in J1Jγ. We let (X0,Y0)(Cn,{0}) (and (Xb,Yb)=({0},Cm)). Suppose that sC+D(=g(sA1+A2)h[Ic+1,Jc+1]) exists and is an n0×n0 upper triangular matrix. Let Zβ denote the vector space spanned by the rows of g having the last n0β indices in Ic+1, and let Wβ denote the vector space spanned by the rows of h having the first β indices in Jc+1. Let Xc,βXc+1+Zβ and Yc,βYc+Wβ, where + is the direct sum. Consider all indices γ with (|Iγ|,|Jγ|)(|Iγ+1|,|Jγ+1|), and suppose that they are ordered as 0γ0<γ1<<γθb.

Proposition 4.32.

  • (1) {(Xγα,Yγα)}α=0θ is the coarse DM-flag of (A1,A2).

  • (2) Suppose that sC+D is an n0×n0 upper triangular pencil. Then the union of {(Xγ,Yγ)}γ=0b and {(Xc,β,Yc,β)}β=1n01 is a DM-flag of (A1,A2).

Proof.

Part (1). Suppose that EA consists of (Xα,Yα) for α=0,1,2,θ, arranged as in (4.25). We show (Xα,Yα)=(Xγα,Yγα) for α=0,1,2,θ=θ. Consider the convex hull KA of (0,0) and ϕ(Xγ,Yγ) for all γ. Then KA belongs to Convϕ(SA), and the maximal faces of KA are composed of the line segments connecting points ϕ(Xγ,Yγ) from γ=0 to b with bending points ϕ(Xγα,Yγα).

We show KA=Convϕ(SA) by induction on the number b of diagonal blocks. Consider the base case b=1 where the Kronecker form consists of a single block. It suffices to show that EA={(Cn,0),(0,Cm)}. Suppose that sA1+A2 is an n0×n0 regular pencil sC+D. By regularity, there is no (X,Y)SA with dimX+dimY>n0 (otherwise sA1+A2 is singular over C(s)). This means no point in ϕ(SA) beyond the line segment between (n0,0) and (0,n0). Therefore, we have EA={(Cn,0),(0,Cm)}. Suppose that sA1+A2=Ln. Suppose to the contrary that there is (X,Y)SA with dimX/n+dimY/(n+1)>1. By basis change, we may assume that

sA1+A2=(BCOD),
where O is the r×s zero matrix for (r,s)(dimX,dimY). By r1 and r+sn+1, B is a pencil of nr rows and s columns with s>nr. Then kerB contains a polynomial vector with degree at most nr<n; use Cramer’s formula to see this. Necessarily, kersA1+A2 also has such a polynomial vector. This is a contradiction to Theorem 4.30 (ϵ1=ϵc=n). The case sA1+A2=Ln is similar.

Consider a general case of b2. We can choose γ*,α* such that 0<γ*<b, 0<α*<θ, and the line segment between ϕ(Xγ*,Yγ*) and ϕ(Xα*,Yα*) meets with KA only at ϕ(Xγ*,Yγ*). Consider (U,V)(Xγ*+Xα*,Yγ*Yα*) and (U,V)(Xγ*Xα*,Yγ*+Yα*). By the construction and (4.24), one of ϕ(U,V) and ϕ(U,V) is outside of KA. Suppose that ϕ(U,V)KA. Consider the submatrix A(sA1+A2)[γ=1γ*Iγ,γ=1γ*Jγ], which is also a Kronecker form with a smaller number of blocks. From UXγ*, VYγ*, and ϕ(U,V)KA, it necessarily holds that KAConvϕ(SA). However, this is a contradiction to the inductive assumption. The case ϕ(U,V)KA is similar; consider the sub-Kronecker form (sA1+A2)[γ=γ*+1bIγ,γ=γ*+1bJγ].

Part (2). Observe that all integer points in the maximal faces of Convϕ(SA) are obtained by the images of (Xγ,Yγ) and (Xc,β,Yc,β). This implies that {(Xγ,Yγ)}γ{(Xc,β,Yc,β)}β is a maximal chain of LA. □

The matrix pencil g(sA1+A2)h corresponding to a coarse DM-decomposition g(A1,A2)h, which we call a coarse Kronecker triangular form, is a refinement of a quasi-Kronecker triangular form in Berger and Trenn [7] and generalized Schur form in Demmel and Kåragström [16] and Van Dooren [59] if g, h are unitary matrices and sC+D is triangular.

Then, the convergence (Theorem 4.28) of Gradient Descent (4.22) can be applied as follows:

Theorem 4.33

(Convergence to a Coarse Kronecker Triangular Form). Let (xk,yk) be a solution of (4.22). Decompose xk=σkediagakσk and yk=τkediagbkτk, where σk and τk are unitary matrices, and ak and bk are nondecreasing and nonincreasing vectors, respectively. Then, σk(sA1+A2)τk accumulates to coarse Kronecker triangular forms, where the convergence is linear in the same sense as in Theorem 4.28.

A coarse Kronecker triangular form is enough for determining the structure of the Kronecker form. Indeed, each (nonsquare) rectangular diagonal block is a kν×k(ν+1) or k(ν+1)×kν matrix for some integers k,ν, from which all minimal indices ϵ1,ϵ2,,ϵc, η1,η2,,ηd can be identified.

The above theorem suggests an iterative method for determining the minimal indices of a singular pencil, which is based on simple gradient descent and is conceptually different from the existing algorithms, for example, Demmel and Kåragström [16] and Van Dooren [59]. It is an interesting future direction to develop a numerically stable algorithm based on this approach.

Acknowledgments

The authors thank Shin-ichi Ohta, Harold Nieuwboer, and Michael Walter for discussion; Yuni Iwamasa, Taihei Oki, and Tasuku Soma for comments; and Shun Sato for suggesting Sanz Serna and Zygalakis [57]. The authors also thank the referees for numerous helpful comments. The conference version appeared as Hirai and Sakabe [33].

Endnotes

1 Proof sketch: Let α(t)expx0tu and β(t)expy0tv, and define utSx0 by expx0d(x0,β(t))ut=β(t). By convexity of f along the geodesic between x0 and β(t), it holds thatf(expx0sut)f(x0)(s/d(x0,β(t)))(f(β(t))f(x0)) for s[0,d(x0,β(t))]. By the triangle inequality, we have (f(β(t))f(x0))/d(x0,β(t))maxσ{1,1}(f(β(t))f(x0))/(t+σd(x0,y0))fy0(v) for t. By the CAT(0)-inequality on the geodesic triangle of vertices x0,α(t),β(t) and by d(α(t),β(t)) being bounded, it holds that expx0sutα(s) for t. Thus, we have fy0(v)(f(α(s))f(x0))/ssfx0(u). By symmetry, it holds that fx0(u)fy0(v), and hence, fx0(u)=fy0(v).

2 If f(ξ)=f(ξ)=c<0, then by convexity, it holds that f(m)(f(ξ)+f(ξ))/2=c for the midpoint m of ξ and ξ in CM, and by m<1 it holds that f(m/m)=f(m)/m<c.

3 It is found in Ambrosio et al. [3, theorem 4.0.4] for the general setting of gradient flows in metric spaces. For our manifold case, it is an easy consequence of the first variation formula (Sakai [56, proposition 2.2]) as follows: (d/dt)d(ϕt(x),ϕt(y))2/2=f(ϕt(y)),γ˙(1) f(ϕt(x)),γ˙(0)=01(d/dt)f(γ(s)),γ˙(s)ds=012f(γ(s))γ˙(s),γ˙(s)ds0, where γ:[0,1]M is a geodesic from ϕt(x) to ϕt(y).

4 The formal definition of the moment map is given by [v]iμ([v])k (Georgoulas et al. [24, lemma 8.2]).

5 In the earlier versions of this paper, the convergence of x(t)1/2·[v] was stated but the proof was false.

6 The classical DM-decomposition restricts SA to coordinate subspaces and LA to the sublattice of the coordinate subspaces X, Y maximizing dimX+dimY , where g, h are chosen as permutation matrices. In this setting, a block-triangular form obtained by using the maximal chain of the entire LA was considered by N. Tomizawa (unpublished) in the development of principal partitions in the 1970’s; see Hayashi et al. [29, section 3]. For this reason, our decomposition may be more precisely called a DMT-decomposition.

References

  • [1] Allen-Zhu Z, Garg A, Li Y, Oliveira R, Wigderson A (2018) Operator scaling via geodesically convex optimization, invariant theory and polynomial identity testing. Diakonikolas I, Kempe D, eds. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. STOC 2018 (Association for Computing Machinery, New York), 172–181.Google Scholar
  • [2] Alvarez F, Bolte J, Brahic O (2004) Hessian Riemannian gradient flows in convex programming. SIAM J. Control Optim. 43(2):477–501.CrossrefGoogle Scholar
  • [3] Ambrosio L, Gigli N, Savaré G (2008) Gradient Flows: In Metric Spaces and in the Space of Probability Measures, 2nd ed. (Birkhäuser, Basel, Switzerland).Google Scholar
  • [4] Auslender A (1997) How to deal with the unbounded in optimization: Theory and algorithms. Math. Programming 79(1–3):3–18.CrossrefGoogle Scholar
  • [5] Bačák M (2014) Convex Analysis and Optimization in Hadamard Spaces (De Gruyter, Berlin).CrossrefGoogle Scholar
  • [6] Beck A (2017) First-Order Methods in Optimization (Society for Industrial and Applied Mathematics, Philadelphia).CrossrefGoogle Scholar
  • [7] Berger T, Trenn S (2012) The quasi-Kronecker form for matrix pencils. SIAM J. Matrix Anal. Appl. 33(2):336–368.CrossrefGoogle Scholar
  • [8] Boumal N (2023) An Introduction to Optimization on Smooth Manifolds (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [9] Bridson MR, Haefliger A (1999) Metric Spaces of Non-Positive Curvature (Springer-Verlag, Berlin).CrossrefGoogle Scholar
  • [10] Bubeck S (2015) Convex optimization: Algorithms and complexity. Foundations Trends Machine Learn. 8(3–4):231–357.CrossrefGoogle Scholar
  • [11] Bürgisser P, Li Y, Nieuwboer H, Walter M (2020) Interior-point methods for unconstrained geometric programming and scaling problems. Preprint, submitted August 27, https://arxiv.org/abs/2008.12110.Google Scholar
  • [12] Bürgisser P, Franks C, Garg A, Oliveira R, Walter M, Wigderson A (2018) Efficient algorithms for tensor scaling, quantum marginals, and moment polytopes. Thorup M, ed. Proc. 59th IEEE Annual Sympos. Foundations Comput. Sci. FOCS 2018 (IEEE, New York), 883–897.Google Scholar
  • [13] Bürgisser P, Franks C, Garg A, Oliveira R, Walter M, Wigderson A (2019) Towards a theory of non-commutative optimization: Geodesic 1st and 2nd order methods for moment maps and polytopes. Proc. 60th IEEE Annual Sympos. Foundations Comput. Sci. FOCS 2019 (IEEE, New York), 845–861.Google Scholar
  • [14] Caprace P-E, Lytchak A (2010) At infinity of finite-dimensional CAT(0) spaces. Math. Ann. 346(1):1–21.CrossrefGoogle Scholar
  • [15] Chen X, Sun S (2014) Calabi flow, geodesic rays, and uniqueness of constant scalar curvature Kähler metrics. Ann. Math. 180(2):407–454.CrossrefGoogle Scholar
  • [16] Demmel J, Kåragström B (1993) The generalized Schur decomposition of an arbitrary pencil AλB—Robust software with error bounds and applications. I. Theory and algorithms. ACM Trans. Math. Software 19(2):160–174.CrossrefGoogle Scholar
  • [17] Dulmage AL, Mendelsohn NS (1958) Coverings of bipartite graphs. Canadian J. Math. 10:517–534.CrossrefGoogle Scholar
  • [18] Franks C (2018) Operator scaling with specified marginals. Diakonikolas I, Kempe D, eds. Proc. 50th Annual ACM SIGACT Sympos. Theory Comput. (STOC 2018) (Association for Computing Machinery, New York). 190–203.Google Scholar
  • [19] Franks C, Soma T, Goemans MX (2023) Shrunk subspaces via operator Sinkhorn iteration. Bansal N, Nagarajan V, eds. Proc. 2023 Annual ACM-SIAM Sympos. Discrete Algorithms (SODA) (SIAM, Philadelphia), 1655–1668.Google Scholar
  • [20] Gantmacher FR (1959) The Theory of Matrices, vol. 1–2 (Chelsea Publishing Co., New York).Google Scholar
  • [21] Garg A, Oliveira R (2018) Recent progress on scaling algorithms and applications. Bull. Eur. Assoc. Theoret. Comput. Sci. (125):14–49.Google Scholar
  • [22] Garg A, Gurvits L, Oliveira R, Wigderson A (2018) Algorithmic and optimization aspects of Brascamp-Lieb inequalities, via operator scaling. Geometric Funct. Anal. 28(1):100–145.CrossrefGoogle Scholar
  • [23] Garg A, Gurvits L, Oliveira R, Wigderson A (2020) Operator scaling: Theory and applications. Foundations Comput. Math. 20(2):223–290.CrossrefGoogle Scholar
  • [24] Georgoulas V, Robbin JW, Salamon DA (2021) The Moment-Weight Inequality and the Hilbert-Mumford Criterion—GIT from the Differential Geometric Viewpoint, Lecture Notes in Mathematics, vol. 2297 (Springer, Cham, Switzerland).Google Scholar
  • [25] Guillemin V, Sternberg S (1982) Convexity properties of the moment mapping. Inventiones Math. 67(3):491–513.CrossrefGoogle Scholar
  • [26] Guillemin V, Sternberg S (1984) Convexity properties of the moment mapping. II. Inventiones Math. 77(3):533–546.CrossrefGoogle Scholar
  • [27] Gurvits L (2004) Classical complexity and quantum entanglement. J. Comput. System Sci. 69(3):448–484.CrossrefGoogle Scholar
  • [28] Hamada M, Hirai H (2021) Computing the nc-rank via discrete convex optimization on CAT(0) spaces. SIAM J. Appl. Algebra Geometry 5(3):455–478.CrossrefGoogle Scholar
  • [29] Hayashi K, Hirai H, Sakabe K (2023) Finding Hall blockers by matrix scaling. Math. Oper. Res. 49(4):2166–2179.LinkGoogle Scholar
  • [30] Hirai H (2024) Convex analysis on Hadamard spaces and scaling problems. Foundations Comput. Math. 24(6):1979–2016.CrossrefGoogle Scholar
  • [31] Hirai H (2025) Generalized gradient flows in Hadamard manifolds and convex optimization on entanglement polytopes. Preprint, submitted November 15, https://arxiv.org/abs/2511.12064v1.Google Scholar
  • [32] Hirai H (2026) A scaling characterization of nc-rank via unbounded gradient flow. Linear Algebra Appl. 730:525–545.CrossrefGoogle Scholar
  • [33] Hirai H, Sakabe K (2024) Gradient descent for unbounded convex functions on Hadamard manifolds and its applications to scaling problems. Proc. 65th IEEE Sympos. Foundations Comput. Sci. (FOCS 2024) (IEEE, New York), 2387–2402.Google Scholar
  • [34] Hirai H, Nieuwboer H, Walter M (2023) Interior-point methods on manifolds: Theory and applications. Proc. 64th IEEE Sympos. Foundations Comput. Sci. (FOCS 2023) (IEEE, New York), 2021–2030.Google Scholar
  • [35] Hiriart-Urruty J-B, Lemaréchal C (2001) Fundamentals of Convex Analysis (Springer-Verlag, Berlin).CrossrefGoogle Scholar
  • [36] Ito H, Iwata S, Murota K (1994) Block-triangularizations of partitioned matrices under similarity/equivalence transformations. SIAM J. Matrix Anal. Appl. 15(4):1226–1255.CrossrefGoogle Scholar
  • [37] Ivanyos G, Qiao Y, Subrahmanyam KV (2017) Non-commutative Edmonds’ problem and matrix semi-invariants. Comput. Complexity 26(3):717–763.CrossrefGoogle Scholar
  • [38] Ivanyos G, Qiao Y, Subrahmanyam KV (2018) Constructive non-commutative rank computation is in deterministic polynomial time. Comput. Complexity 27(4):561–593.CrossrefGoogle Scholar
  • [39] Iwamasa Y, Oki T, Soma T (2025) Algorithmic aspects of semistability of quiver representations. Censor-Hillel K, Grandoni F, Ouaknine J, Puppis G, eds. 52nd Internat. Colloquium Automata Languages Programming (ICALP 2025) (Schloss Dagstuhl - Leibniz-Zentrum für Informatik, Wadern, Germany), 99:1–99:18.Google Scholar
  • [40] Kapovich M, Leeb B, Millson J (2009) Convex functions on symmetric spaces, side lengths of polygons and the stability inequalities for weighted configurations at infinity. J. Differential Geometry 81(2):297–354.CrossrefGoogle Scholar
  • [41] Karlsson A, Margulis GA (1999) A multiplicative ergodic theorem and nonpositively curved spaces. Comm. Math. Phys. 208(1):107–123.CrossrefGoogle Scholar
  • [42] Kempf GR (1978) Instability in invariant theory. Ann. Math. 108(2):299–316.CrossrefGoogle Scholar
  • [43] Kirwan F (1984) Convexity properties of the moment mapping. III. Inventiones Math. 77(3):547–552.CrossrefGoogle Scholar
  • [44] Kleiner B, Leeb B (2006) Rigidity of invariant convex sets in symmetric spaces. Inventiones Math. 163(3):657–676.CrossrefGoogle Scholar
  • [45] Kwok TC, Lau LC, Ramachandran A (2021) Spectral analysis of matrix scaling and operator scaling. SIAM J. Comput. 50(3):1034–1102.CrossrefGoogle Scholar
  • [46] Lu H, Freund RM, Nesterov Y (2018) Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1):333–354.CrossrefGoogle Scholar
  • [47] Mayer UF (1998) Gradient flows on nonpositively curved metric spaces and harmonic maps. Comm. Anal. Geometry 6(2):199–253.CrossrefGoogle Scholar
  • [48] Murota K (2000) Matrices and Matroids for Systems Analysis (Springer-Verlag, Berlin).Google Scholar
  • [49] Nemirovsky AS, Yudin DB (1983) Problem Complexity and Method Efficiency in Optimization (John Wiley & Sons, Inc., New York).Google Scholar
  • [50] Obuchowska WT (2004) On the minimizing trajectory of convex functions with unbounded level sets. Comput. Optim. Appl. 27(1):37–52.CrossrefGoogle Scholar
  • [51] Ohta S (2025) Discrete-time gradient flows for unbounded convex functions on Gromov hyperbolic spaces. Comm. Contemporary Math., ePub ahead of print December 11, https://doi.org/10.1142/S0219199726500033.CrossrefGoogle Scholar
  • [52] Ohta S, Pálfia M (2015) Discrete-time gradient flows and law of large numbers in Alexandrov spaces. Calculus Variations Partial Differential Equations 54(2):1591–1610.CrossrefGoogle Scholar
  • [53] Rockafellar RT (1970) Convex Analysis (Princeton University Press, Princeton, NJ).CrossrefGoogle Scholar
  • [54] Sakabe K (2026) Nesterov’s accelerated gradient for unbounded convex functions finds the minimum-norm point in the dual space. Preprint, submitted February 9, https://arxiv.org/abs/2602.08618.Google Scholar
  • [55] Sakabe K, Doğan ML, Walter M (2026) Strassen’s support functionals coincide with the quantum functionals. Preprint, submitted January 29, https://arxiv.org/abs/2601.21553.Google Scholar
  • [56] Sakai T (1996) Riemannian Geometry (American Mathematical Society, Providence, RI).CrossrefGoogle Scholar
  • [57] Sanz Serna JM, Zygalakis KC (2020) Contractivity of Runge-Kutta methods for convex gradient systems. SIAM J. Numer. Anal. 58(4):2079–2092.CrossrefGoogle Scholar
  • [58] Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist. 35(2):876–879.CrossrefGoogle Scholar
  • [59] Van Dooren P (1979) The computation of Kronecker’s canonical form of a singular pencil. Linear Algebra Appl. 27:103–140.CrossrefGoogle Scholar
  • [60] Vishnoi NK (2021) Algorithms for Convex Optimization (Cambridge University Press, Cambridge, UK).CrossrefGoogle Scholar
  • [61] Wallach NR (2017) Geometric Invariant Theory (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • [62] Woodward CT (2011) Moment maps and geometric invariant theory. Preprint, submitted June 29, https://arxiv.org/abs/0912.1132.Google Scholar