Bilevel Gradient Methods and the Morse Parametric Qualification Condition

Published Online:https://doi.org/10.1287/moor.2025.0914

References

  • [1] Allgower EL, Georg K (1997) Numerical path following. Ciarlet PG, Lions JL, eds. Handbook of Numerical Analysis, vol. 5 (Elsevier, Amsterdam), 3–207.Google Scholar
  • [2] Antoniou A, Edwards H, Storkey A (2019) How to train your MAML. 7th Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • [3] Arbel M, Mairal J (2022) Non-convex bilevel games with critical point selection maps. Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, eds. Annual Conf. Neural Inform. Processing Systems, NeurIPS 2022, Advances in Neural Information Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 1–34.Google Scholar
  • [4] Attouch H, Bolte J, Svaiter BF (2013) Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Programming 137:91–129.CrossrefGoogle Scholar
  • [5] Attouch H, Bolte J, Redont P, Soubeyran A (2010) Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res. 35(2):438–457.LinkGoogle Scholar
  • [6] Bai S, Kolter JZ, Koltun V (2019) Deep equilibrium models. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • [7] Bolte J, Hochart A, Pauwels E (2018) Qualification conditions in semialgebraic programming. SIAM J. Optim. 28(2):1867–1891.CrossrefGoogle Scholar
  • [8] Bolte J, Le Q-T, Pauwels E (2025) Convergence of optimizers implies eigenvalues filtering at equilibrium. Preprint, submitted October 10, https://arxiv.org/abs/2510.09034.Google Scholar
  • [9] Bolte J, Le T, Moulines É, Pauwels E (2025) Inexact subgradient methods for semialgebraic functions. Math. Programming, Series A.Google Scholar
  • [10] Bolte J, Le Q-T, Pauwels E, Vaiter S (2026) Geometric and computational hardness of bilevel programming. Math. Programming 215(1):539–574.Google Scholar
  • [11] Bouza Allende G, Still G (2012) Solving bilevel programs with the KKT-approach. Math. Programming 138:309–332.CrossrefGoogle Scholar
  • [12] Chen T, Sun Y, Yin W (2021) Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Beygelzimer A, Dauphin Y, Liang P, Wortman Vaughan J, eds. Annual Conf. Neural Inform. Processing Systems, NeurIPS 2021, Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 25294–25307.Google Scholar
  • [13] Chen L, Xu J, Zhang J (2024) On finding small hyper-gradients in bilevel optimization: Hardness results and improved analysis. Thirty Seventh Annual Conf. Learn. Theory (PMLR, New York), 947–980.Google Scholar
  • [14] Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Ann. Oper. Res. 153:235–256.CrossrefGoogle Scholar
  • [15] Coste M (2000) An Introduction to o-Minimal Geometry (Istituti Editoriali e Poligrafici Internazionali, Pisa, Italy).Google Scholar
  • [16] Coste M (2000) An Introduction to Semialgebraic Geometry (Istituti Editoriali e Poligrafici Internazionali, Pisa, Italy).Google Scholar
  • [17] Dagréou M, Ablin P, Vaiter S, Moreau T (2022) A framework for bilevel optimization that enables stochastic and global variance reduction algorithms. Annual Conf. Neural Inform. Processing Systems, NeurIPS 2021, Advances in Neural Information Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY).Google Scholar
  • [18] Dagréou M, Moreau T, Vaiter S, Ablin P (2024) A lower bound and a near-optimal algorithm for bilevel empirical risk minimization. Dasgupta S, Mandt S, Li Y, eds. Internat. Conf. Artificial Intelligence Statist. (AISTATS), Proceedings of Machine Learning Research (PMLR, New York), 82–90.Google Scholar
  • [19] Dempe S (2002) Foundations of Bilevel Programming: Noncex Optimization and Its Applications (Springer US, New York).Google Scholar
  • [20] Dempe S, Zemkoho A, eds. (2020) Bilevel Optimization: Advances and Next Challenges, Springer Optimization and Its Applications, vol. 161 (Springer, Cham, Switzerland).CrossrefGoogle Scholar
  • [21] Dempe S, Mordukhovich BS, Zemkoho AB (2012) Sensitivity analysis for two-level value functions with applications to bilevel programming. SIAM J. Optim. 22(4):1309–1343.CrossrefGoogle Scholar
  • [22] Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. Proc. 34th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 70 (PMLR, New York), 1126–1135.Google Scholar
  • [23] Franceschi L, Frasconi P, Salzo S, Grazzi R, Pontil M (2018) Bilevel programming for hyperparameter optimization and meta-learning. Proc. 35th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research (PMLR, New York), 1563–1572.Google Scholar
  • [24] Ghadimi S, Wang M (2018) Approximation methods for bilevel programming. Preprint, submitted February 6, https://arxiv.org/abs/1802.02246.Google Scholar
  • [25] Gilbert JC (1992) Automatic differentiation and iterative processes. Optim. Methods Software 1(1):13–21.CrossrefGoogle Scholar
  • [26] Goudou X, Munier J (2009) The gradient and heavy ball with friction dynamical systems: The quasiconvex case. Math. Programming 116(1–2):173–191.CrossrefGoogle Scholar
  • [27] Grazzi R, Franceschi L, Pontil M, Salzo S (2020) On the iteration complexity of hypergradient computation. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research (PMLR, New York), 3748–3758.Google Scholar
  • [28] Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (SIAM, Philadelphia).CrossrefGoogle Scholar
  • [29] Guillemin V, Pollack A (2010) Differential Topology (AMS Chelsea Publishing, Providence, RI).CrossrefGoogle Scholar
  • [30] He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015 IEEE Internat. Conf. Comput. Vision (ICCV) (IEEE Computer Society, Washington, DC), 1026–1034.Google Scholar
  • [31] Henrion R, Surowiec T (2011) On calmness conditions in convex bilevel programming. Applicable Anal. 90:951–970.CrossrefGoogle Scholar
  • [32] Ji K, Liang Y (2023) Lower bounds and accelerated algorithms for bilevel optimization. J. Machine Learn. Res. 24(22):1–56.Google Scholar
  • [33] Ji K, Yang J, Liang Y (2021) Bilevel optimization: Convergence analysis and enhanced design. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research Series, vol. 139 (PMLR, New York), 4882–4892.Google Scholar
  • [34] Ji K, Yang J, Liang Y (2022) Theoretical convergence of multi-step model-agnostic meta-learning. J. Machine Learn. Res. 23(29):1–41.Google Scholar
  • [35] Ji K, Lee JD, Liang Y, Poor HV (2020) Convergence of meta-learning with task-specific adaptation over partial parameters. Adv. Neural Inform. Processing Systems 33:11490–11500.Google Scholar
  • [36] Ji K, Liu M, Liang Y, Ying L (2022) Will bilevel optimizers benefit from loops. Adv. Neural Inform. Processing Systems 35:3011–3023.CrossrefGoogle Scholar
  • [37] Kolstad CD, Lasdon LS (1990) Derivative evaluation and computational experience with large bilevel mathematical programs. J. Optim. Theory Appl. 65:485–499.CrossrefGoogle Scholar
  • [38] Kwon J, Kwon D, Wright S, Nowak R (2023) On penalty methods for nonconvex bilevel optimization and first-order stochastic approximation. Preprint, submitted September 4, https://arxiv.org/abs/2309.01753.Google Scholar
  • [39] Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. Feldman V, Rakhlin A, Shamir O, eds. 29th Annual Conf. Learn. Theory, Proceedings of Machine Learning Research, vol. 49 (PMLR, New York), 1246–1257.Google Scholar
  • [40] Lin G-H, Xu M, Ye JJ (2014) On solving simple bilevel programs with a nonconvex lower level program. Math. Programming 144:277–305.CrossrefGoogle Scholar
  • [41] Liu H, Simonyan K, Yang Y (2019) DARTS: Differentiable architecture search. Internat. Conf. Learn. Representations (ICLR) (OpenReview.net).Google Scholar
  • [42] Liu R, Liu Y, Zeng S, Zhang J (2021) Towards gradient-based bilevel optimization with non-convex followers and beyond. Beygelzimer A, Dauphin Y, Liang P, Wortman Vaughan J, eds. Annual Conf. Neural Inform. Processing Systems, NeurIPS 2021, Advances in Neural Information Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 8662–8675.Google Scholar
  • [43] Liu R, Liu X, Yuan X, Zeng S, Zhang J (2021) A value-function-based interior-point method for non-convex bi-level optimization. Meila M, Zhang T, eds. Proc. 38th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research Series, vol. 139 (PMLR, New York), 6882–6892.Google Scholar
  • [44] Liu R, Mu P, Yuan X, Zeng S, Zhang J (2020) A generic first-order algorithmic framework for bi-level programming beyond lower-level singleton. Proc. 37th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research (PMLR, New York), 6305–6315.Google Scholar
  • [45] Lorraine J, Vicol P, Duvenaud D (2020) Optimizing millions of hyperparameters by implicit differentiation. 23rd Internat. Conf. Artificial Intelligence Statist., Proceedings of Machine Learning Research (PMLR, New York), 1540–1552.Google Scholar
  • [46] Maclaurin D, Duvenaud D, Adams R (2015) Gradient-based hyperparameter optimization through reversible learning. Proc. 32nd Internat. Conf. Machine Learn. (JMLR.org), 2113–2122.Google Scholar
  • [47] Mehmood S, Ochs P (2020) Automatic differentiation of some first-order methods in parametric optimization. 23rd Internat. Conf. Artificial Intelligence and Statistics, Proceedings of Machine Learning Research (PMLR, New York), 1584–1594.Google Scholar
  • [48] Merchav R, Sabach S, Teboulle M (2024) Dynamic fista for convex composite bi-level optimization. Preprint, submitted July 30, https://arxiv.org/abs/2407.21221.Google Scholar
  • [49] Mordukhovich BS (2020) Bilevel optimization and variational analysis. Dempe S, Zemkoho A, eds. Bilevel Optimization: Advances and Next Challenges (Springer International Publishing, Cham, Switzerland), 197–226.CrossrefGoogle Scholar
  • [50] Panageas I, Piliouras G (2017) Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. 8th Innovations Theoret. Comput. Sci. Conf. (ITCS 2017) (Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Wadern, Germany).Google Scholar
  • [51] Pang J-S, Han S-P, Rangaraj N (1991) Minimization of locally Lipschitzian functions. SIAM J. Optim. 1(1):57–82.CrossrefGoogle Scholar
  • [52] Pedregosa F (2016) Hyperparameter optimization with approximate gradient. Balcan MF, Weinberger KQ, eds. Proc. 33rd Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 48 (PMLR, New York), 737–746.Google Scholar
  • [53] Pemantle R (1990) Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2):698–712.CrossrefGoogle Scholar
  • [54] Rajeswaran A, Finn C, Kakade SM, Levine S (2019) Meta-learning with implicit gradients. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY).Google Scholar
  • [55] Rommel C, Moreau T, Paillard J, Gramfort A (2022) CADDA: Class-wise automatic differentiable data augmentation for EEG signals. Internat. Conf. Learn. Representations (OpenReview.net).Google Scholar
  • [56] Savard G, Gauvin J (1994) The steepest descent direction for the nonlinear bilevel programming problem. Oper. Res. Lett. 15(5):265–272.CrossrefGoogle Scholar
  • [57] Shen H, Chen T (2023) On penalty-based bilevel gradient descent method. Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, eds. Proc. 40th Internat. Conf. Machine Learn., Proceedings of Machine Learning Research, vol. 202 (PMLR, New York), 30992–31015.Google Scholar
  • [58] Shub M, Fathi A, Langevin R (1987) Global Stability of Dynamical Systems (Springer, New York).CrossrefGoogle Scholar
  • [59] Tadić VB, Doucet A (2011) Asymptotic bias of stochastic gradient search. 2011 50th IEEE Conf. Decision Control Eur. Control Conf. (IEEE, Piscataway, NJ), 722–727.Google Scholar
  • [60] van den Dries L, Miller C (1996) Geometric categories and o-minimal structures. Duke Math. J. 84(2):497–540.CrossrefGoogle Scholar
  • [61] Vicente LN, Calamai PH (1994) Bilevel and multilevel programming: A bibliography review. J. Global Optim. 5(3):291–306.CrossrefGoogle Scholar
  • [62] Ye JJ, Zhu DL (1995) Optimality conditions for bilevel programming problems. Optimization 33(1):9–27.CrossrefGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.