Solving Bilevel Optimization via Sequential Minimax Optimization

Zhaosong Lu
Zhaosong Lu
[email protected]
https://orcid.org/0000-0003-3277-7853
Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, Minnesota 55455
Search for more papers by this author
,
Sanyou Mei
Corresponding Author
Sanyou Mei
[email protected]
https://orcid.org/0009-0001-9028-7847
Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong, China
Search for more papers by this author

Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, Minnesota 55455

Corresponding Author

Sanyou Mei

Department of Industrial Engineering and Decision Analytics, Hong Kong University of Science and Technology, Hong Kong, China

Search for more papers by this author

Published Online:6 Jan 2026https://doi.org/10.1287/moor.2024.0521

References

[1] Allende GB, Still G (2013) Solving bilevel programs with the KKT-approach. Math. Programming 138(1):309–332.Crossref, Google Scholar
[2] Bard JF (2013) Practical Bilevel Optimization: Algorithms and Applications, vol. 30 (Springer Science & Business Media, New York).Google Scholar
[3] Bennett KP, Kunapuli G, Hu J, Pang J-S (2008) Bilevel optimization and machine learning. IEEE World Congress Comput. Intelligence (Springer, Berlin), 25–47.Google Scholar
[4] Bertinetto L, Henriques JF, Torr P, Vedaldi A (2018) Meta-learning with differentiable closed-form solvers. Internat. Conf. Learning Representations (OpenReview, Alameda, CA).Google Scholar
[5] Chang C-C, Lin C-J (2011) LIBSVM: A library for support vector machines. ACM Trans. Intelligent Systems Tech. 2(3):1–27.Crossref, Google Scholar
[6] Chen L, Xu J, Zhang J (2023) Bilevel optimization without lower-level strong convexity from the hyper-objective perspective. Preprint, submitted January 2, https://arxiv.org/abs/2301.00712.Google Scholar
[7] Chen T, Sun Y, Xiao Q, Yin W (2022) A single-timescale method for stochastic bilevel optimization. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 2466–2488.Google Scholar
[8] Clarke FH (1990) Optimization and Nonsmooth Analysis (SIAM, Philadelphia).Crossref, Google Scholar
[9] Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Ann. Oper. Res. 153(1):235–256.Crossref, Google Scholar
[10] Crockett C, Fessler JA (2022) Bilevel methods for image reconstruction. Foundations Trends Signal Processing 15(2–3):121–289.Crossref, Google Scholar
[11] Dai Y-H, Zhang L (2020) Optimality conditions for constrained minimax optimization. Preprint, submitted April 21, https://arxiv.org/abs/2004.09730.Google Scholar
[12] Dai Y-H, Wang J, Zhang L (2024) Optimality conditions and numerical algorithms for a class of linearly constrained minimax optimization problems. SIAM J. Optim. 34(3):2883–2916.Crossref, Google Scholar
[13] Dempe S (2002) Foundations of Bilevel Programming (Springer Science & Business Media, New York).Google Scholar
[14] Dempe S, Zemkoho A (2013) The bilevel programming problem: Reformulations, constraint qualifications and optimality conditions. Math. Programming 138(1):447–473.Crossref, Google Scholar
[15] Dempe S, Zemkoho A (2020) Bilevel optimization. Springer Optimization and Its Applications, vol. 161 (Springer, Berlin).Google Scholar
[16] Dempe S, Kalashnikov V, Pérez-Valdés GA, Kalashnykova N (2015) Bilevel programming problems. Energy Systems, vol. 10 (Springer, Berlin).Google Scholar
[17] Feurer M, Hutter F (2019) Hyperparameter optimization. Automated Machine Learning (Springer, Cham, Switzerland), 3–33.Crossref, Google Scholar
[18] Franceschi L, Donini M, Frasconi P, Pontil M (2017) Forward and reverse gradient-based hyperparameter optimization. Internat. Conf. Machine Learn. (PMLR, New York), 1165–1173.Google Scholar
[19] Franceschi L, Frasconi P, Salzo S, Grazzi R, Pontil M (2018) Bilevel programming for hyperparameter optimization and meta-learning. Internat. Conf. Machine Learn. (PMLR, New York), 1568–1577.Google Scholar
[20] Grant M, Boyd S (2014) CVX: Matlab software for disciplined convex programming, version 2.1, https://cvxr.com/cvx/citing/.Google Scholar
[21] Grazzi R, Franceschi L, Pontil M, Salzo S (2020) On the iteration complexity of hypergradient computation. Internat. Conf. Machine Learn. (PMLR, New York), 3748–3758.Google Scholar
[22] Guo Z, Hu Q, Zhang L, Yang T (2021) Randomized stochastic variance-reduced methods for multi-task stochastic bilevel optimization. Preprint, submitted May 5, https://arxiv.org/abs/2105.02266.Google Scholar
[23] Hansen P, Jaumard B, Savard G (1992) New branch-and-bound rules for linear bilevel programming. SIAM J. Sci. Statist. Comput. 13(5):1194–1217.Crossref, Google Scholar
[24] Hastie T, Tibshirani R, Friedman JH, Friedman JH (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2 (Springer, Berlin).Crossref, Google Scholar
[25] Hong M, Wai H-T, Wang Z, Yang Z (2023) A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM J. Optim. 33(1):147–180.Crossref, Google Scholar
[26] Hu X, Xiao N, Liu X, Toh K-C (2023) An improved unconstrained approach for bilevel optimization. SIAM J. Optim. 33(4):2801–2829.Crossref, Google Scholar
[27] Huang F, Li J, Gao S (2021) BiAdam: Fast adaptive bilevel optimization methods. Preprint, submitted June 21, https://arxiv.org/abs/2106.11396.Google Scholar
[28] Huang M, Chen X, Ji K, Ma S, Lai L (2025) Efficiently escaping saddle points in bilevel optimization. J. Machine Learn. Res. 26(1):1–61.Google Scholar
[29] Ishizuka Y, Aiyoshi E (1992) Double penalty method for bilevel optimization problems. Ann. Oper. Res. 34(1):73–88.Crossref, Google Scholar
[30] Ji K, Yang J, Liang Y (2021) Bilevel optimization: Convergence analysis and enhanced design. Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 4882–4892.Google Scholar
[31] Ji K, Lee JD, Liang Y, Poor HV (2020) Convergence of meta-learning with task-specific adaptation over partial parameters. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 11490–11500.Google Scholar
[32] Khanduri P, Zeng S, Hong M, Wai H-T, Wang Z, Yang Z (2021) A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 30271–30283.Google Scholar
[33] Konda V, Tsitsiklis J (1999) Actor-critic algorithms. Adv. Neural Inform. Processing Systems, vol. 12 (MIT Press, Cambridge, MA), 1008–1014.Google Scholar
[34] Kong W, Monteiro RD (2021) An accelerated inexact proximal point method for solving nonconvex-concave min-max problems. SIAM J. Optim. 31(4):2558–2585.Crossref, Google Scholar
[35] Kovalev D, Gasnikov A (2022) The first optimal algorithm for smooth and strongly-convex-strongly-concave minimax optimization. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 14691–14703.Google Scholar
[36] Kwon J, Kwon D, Wright S, Nowak RD (2023) A fully first-order method for stochastic bilevel optimization. Internat. Conf. Machine Learn. (PMLR, New York), 18083–18113.Google Scholar
[37] Li J, Gu B, Huang H (2022) A fully single loop algorithm for bilevel optimization without hessian inverse. Proc. AAAI Conf. Artificial Intelligence 36(7):7426–7434.Google Scholar
[38] Li Y, Lin G-H, Zhang J, Zhu X (2023) A novel approach for bilevel programs based on Wolfe duality. Preprint, submitted February 14, https://arxiv.org/abs/2302.06838.Google Scholar
[39] Lin Q, Lu Z, Xiao L (2015) An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4):2244–2273.Crossref, Google Scholar
[40] Liu H, Simonyan K, Yang Y (2018) Darts: Differentiable architecture search. Internat. Conf. Learn. Representations, vol. 176 (Elsevier, Amsterdam), 42–48.Google Scholar
[41] Liu R, Gao J, Zhang J, Meng D, Lin Z (2021) Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. IEEE Trans. Pattern Anal. Machine Intelligence 44(12):10045–10067.Crossref, Google Scholar
[42] Liu B, Ye M, Wright S, Stone P, Liu Q (2022) Bome! Bilevel optimization made easy: A simple first-order approach. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 17248–17262.Google Scholar
[43] Lopez-Paz D, Ranzato M (2017) Gradient episodic memory for continual learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY), 6470–6479.Google Scholar
[44] Lu Z, Mei S (2024) A first-order method for nonconvex-strongly-concave constrained minimax optimization. Preprint, submitted January 5, 2023, https://arxiv.org/abs/2301.02060.Google Scholar
[45] Lu Z, Mei S (2024) First-order penalty methods for bilevel optimization. SIAM J. Optim. 34(2):1937–1969.Crossref, Google Scholar
[46] Lu Z, Mei S (2025) A first-order augmented Lagrangian method for constrained minimax optimization. Math. Programming 213:1063–1104.Crossref, Google Scholar
[47] Lu Z, Zhou Z (2023) Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming. SIAM J. Optim. 33(2):1159–1190.Crossref, Google Scholar
[48] Luo Z-Q, Pang J-S, Ralph D (1996) Mathematical Programs with Equilibrium Constraints (Cambridge University Press, Cambridge, UK). Crossref, Google Scholar
[49] Ma X, Yao W, Ye JJ, Zhang J (2021) Combined approach with second-order optimality conditions for bilevel programming problems. Preprint, submitted July 31, https://arxiv.org/abs/2108.00179.Google Scholar
[50] Maclaurin D, Duvenaud D, Adams R (2015) Gradient-based hyperparameter optimization through reversible learning. Internat. Conf. Machine Learn. (PMLR, New York), 2113–2122.Google Scholar
[51] Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2018) Towards deep learning models resistant to adversarial attacks. Internat. Conf. Learn. Representations (MIT Press, Cambridge, MA).Google Scholar
[52] Mirrlees JA (1999) The theory of moral hazard and unobservable behaviour: Part I. Rev. Econom. Stud. 66(1):3–21.Crossref, Google Scholar
[53] Nesterov YE (2004) Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers, MA).Crossref, Google Scholar
[54] Okuno T, Takeda A, Kawana A, Watanabe M (2021) On ℓp-hyperparameter learning via bilevel nonsmooth optimization. J. Machine Learn. Res. 22(1):11093–11139.Google Scholar
[55] Outrata J, Kocvara M, Zowe J (1998) Nonsmooth Approach to Optimization Problems with Equilibrium Constraints: Theory, Applications and Numerical Results, vol. 28 (Springer Science & Business Media, New York).Crossref, Google Scholar
[56] Pedregosa F (2016) Hyperparameter optimization with approximate gradient. Internat. Conf. Machine Learn. (PMLR, New York), 737–746.Google Scholar
[57] Rajeswaran A, Finn C, Kakade SM, Levine S (2019) Meta-learning with implicit gradients. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 113–124.Google Scholar
[58] Shen H, Chen T (2023) On penalty-based bilevel gradient descent method. Internat. Conf. Machine Learn. (PMLR, New York), 30992–31015.Google Scholar
[59] Shen X, Yu L (2013) CU splitting early termination based on weighted SVM. EURASIP J. Image Video Processing 2013(1):4.Crossref, Google Scholar
[60] Shi C, Lu J, Zhang G (2005) An extended Kuhn–Tucker approach for linear bilevel programming. Appl. Math. Comput. 162(1):51–63.Crossref, Google Scholar
[61] Shimizu K, Ishizuka Y, Bard JF (2012) Nondifferentiable and Two-Level Mathematical Programming (Springer Science & Business Media, New York).Google Scholar
[62] Sinha A, Malo P, Deb K (2017) A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Trans. Evolutionary Comput. 22(2):276–295.Crossref, Google Scholar
[63] Sow D, Ji K, Guan Z, Liang Y (2022) A primal-dual approach to bilevel optimization with multiple inner minima. Preprint, submitted March 1, https://arxiv.org/abs/2203.01123.Google Scholar
[64] Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R (2013) Intriguing properties of neural networks. Preprint, submitted December 21, https://arxiv.org/abs/1312.6199.Google Scholar
[65] Tseng P (2008) On Accelerated Proximal Gradient Methods for Convex-Concave Optimization (MIT Press, Cambridge, MA), 1–20.Google Scholar
[66] Vicente LN, Calamai PH (1994) Bilevel and multilevel programming: A bibliography review. J. Global Optim. 5(3):291–306.Crossref, Google Scholar
[67] Von Stackelberg H (2010) Market Structure and Equilibrium (Springer Science & Business Media, New York).Google Scholar
[68] Wang X, Pan R, Pi R, Zhang T (2023) Effective bilevel optimization via minimax reformulation. Preprint, submitted May 22, https://arxiv.org/abs/2305.13153.Google Scholar
[69] Ward D, Borwein JM (1987) Nonsmooth calculus in finite dimensions. SIAM J. Control Optim. 25(5):1312–1340.Crossref, Google Scholar
[70] Xanthopoulos P, Razzaghi T (2014) A weighted support vector machine method for control chart pattern recognition. Computers Indust. Engrg. 70:134–149.Crossref, Google Scholar
[71] Yang J, Ji K, Liang Y (2021) Provably faster algorithms for bilevel optimization. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 13670–13682.Google Scholar
[72] Yang X, Song Q, Wang Y (2007) A weighted support vector machine for data classification. Internat. J. Pattern Recognition Artificial Intellience 21(5):961–976.Crossref, Google Scholar
[73] Yao W, Yin H, Zeng S, Zhang J (2024) Overcoming lower-level constraints in bilevel optimization: A novel approach with regularized gap functions. Preprint, submitted June 4, https://arxiv.org/abs/2406.01992.Google Scholar
[74] Yao W, Yu C, Zeng S, Zhang J (2024) Constrained bi-level optimization: Proximal Lagrangian value function approach and Hessian-free algorithm. Preprint, submitted January 29, https://arxiv.org/abs/2401.16164.Google Scholar
[75] Ye JJ (2020) Constraint qualifications and optimality conditions in bilevel optimization. Bilevel Optimization (Springer, Berlin), 227–251.Crossref, Google Scholar
[76] Ye JJ, Yuan X, Zeng S, Zhang J (2023) Difference of convex algorithms for bilevel programs with applications in hyperparameter selection. Math. Programming 198(2):1583–1616.Crossref, Google Scholar

cover image Mathematics of Operations Research

Articles In Advance

Article Information

Metrics

Information

Received:May 12, 2024
Accepted:November 08, 2025
Published Online:January 06, 2026

Cite as

Zhaosong Lu, Sanyou Mei (2026) Solving Bilevel Optimization via Sequential Minimax Optimization. Mathematics of Operations Research 0(0).

https://doi.org/10.1287/moor.2024.0521

Keywords

Acknowledgments

This research was primarily conducted during Sanyou Mei’s PhD studies at the University of Minnesota.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Solving Bilevel Optimization via Sequential Minimax Optimization

References

Articles In Advance

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News