Solving Bilevel Optimization via Sequential Minimax Optimization
Published Online:6 Jan 2026https://doi.org/10.1287/moor.2024.0521
References
- [1] (2013) Solving bilevel programs with the KKT-approach. Math. Programming 138(1):309–332.Crossref, Google Scholar
- [2] (2013) Practical Bilevel Optimization: Algorithms and Applications, vol. 30 (Springer Science & Business Media, New York).Google Scholar
- [3] (2008) Bilevel optimization and machine learning. IEEE World Congress Comput. Intelligence (Springer, Berlin), 25–47.Google Scholar
- [4] (2018) Meta-learning with differentiable closed-form solvers. Internat. Conf. Learning Representations (OpenReview, Alameda, CA).Google Scholar
- [5] (2011) LIBSVM: A library for support vector machines. ACM Trans. Intelligent Systems Tech. 2(3):1–27.Crossref, Google Scholar
- [6] (2023) Bilevel optimization without lower-level strong convexity from the hyper-objective perspective. Preprint, submitted January 2, https://arxiv.org/abs/2301.00712.Google Scholar
- [7] (2022) A single-timescale method for stochastic bilevel optimization. Internat. Conf. Artificial Intelligence Statist. (PMLR, New York), 2466–2488.Google Scholar
- [8] (1990) Optimization and Nonsmooth Analysis (SIAM, Philadelphia).Crossref, Google Scholar
- [9] (2007) An overview of bilevel optimization. Ann. Oper. Res. 153(1):235–256.Crossref, Google Scholar
- [10] (2022) Bilevel methods for image reconstruction. Foundations Trends Signal Processing 15(2–3):121–289.Crossref, Google Scholar
- [11] (2020) Optimality conditions for constrained minimax optimization. Preprint, submitted April 21, https://arxiv.org/abs/2004.09730.Google Scholar
- [12] (2024) Optimality conditions and numerical algorithms for a class of linearly constrained minimax optimization problems. SIAM J. Optim. 34(3):2883–2916.Crossref, Google Scholar
- [13] (2002) Foundations of Bilevel Programming (Springer Science & Business Media, New York).Google Scholar
- [14] (2013) The bilevel programming problem: Reformulations, constraint qualifications and optimality conditions. Math. Programming 138(1):447–473.Crossref, Google Scholar
- [15] (2020) Bilevel optimization. Springer Optimization and Its Applications, vol. 161 (Springer, Berlin).Google Scholar
- [16] (2015) Bilevel programming problems. Energy Systems, vol. 10 (Springer, Berlin).Google Scholar
- [17] (2019) Hyperparameter optimization. Automated Machine Learning (Springer, Cham, Switzerland), 3–33.Crossref, Google Scholar
- [18] (2017) Forward and reverse gradient-based hyperparameter optimization. Internat. Conf. Machine Learn. (PMLR, New York), 1165–1173.Google Scholar
- [19] (2018) Bilevel programming for hyperparameter optimization and meta-learning. Internat. Conf. Machine Learn. (PMLR, New York), 1568–1577.Google Scholar
- [20] (2014) CVX: Matlab software for disciplined convex programming, version 2.1, https://cvxr.com/cvx/citing/.Google Scholar
- [21] (2020) On the iteration complexity of hypergradient computation. Internat. Conf. Machine Learn. (PMLR, New York), 3748–3758.Google Scholar
- [22] (2021) Randomized stochastic variance-reduced methods for multi-task stochastic bilevel optimization. Preprint, submitted May 5, https://arxiv.org/abs/2105.02266.Google Scholar
- [23] (1992) New branch-and-bound rules for linear bilevel programming. SIAM J. Sci. Statist. Comput. 13(5):1194–1217.Crossref, Google Scholar
- [24] (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2 (Springer, Berlin).Crossref, Google Scholar
- [25] (2023) A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM J. Optim. 33(1):147–180.Crossref, Google Scholar
- [26] (2023) An improved unconstrained approach for bilevel optimization. SIAM J. Optim. 33(4):2801–2829.Crossref, Google Scholar
- [27] (2021) BiAdam: Fast adaptive bilevel optimization methods. Preprint, submitted June 21, https://arxiv.org/abs/2106.11396.Google Scholar
- [28] (2025) Efficiently escaping saddle points in bilevel optimization. J. Machine Learn. Res. 26(1):1–61.Google Scholar
- [29] (1992) Double penalty method for bilevel optimization problems. Ann. Oper. Res. 34(1):73–88.Crossref, Google Scholar
- [30] (2021) Bilevel optimization: Convergence analysis and enhanced design. Internat. Conf. Machine Learn., vol. 139 (PMLR, New York), 4882–4892.Google Scholar
- [31] (2020) Convergence of meta-learning with task-specific adaptation over partial parameters. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates Inc., Red Hook, NY), 11490–11500.Google Scholar
- [32] (2021) A near-optimal algorithm for stochastic bilevel optimization via double-momentum. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates Inc., Red Hook, NY), 30271–30283.Google Scholar
- [33] (1999) Actor-critic algorithms. Adv. Neural Inform. Processing Systems, vol. 12 (MIT Press, Cambridge, MA), 1008–1014.Google Scholar
- [34] (2021) An accelerated inexact proximal point method for solving nonconvex-concave min-max problems. SIAM J. Optim. 31(4):2558–2585.Crossref, Google Scholar
- [35] (2022) The first optimal algorithm for smooth and strongly-convex-strongly-concave minimax optimization. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 14691–14703.Google Scholar
- [36] (2023) A fully first-order method for stochastic bilevel optimization. Internat. Conf. Machine Learn. (PMLR, New York), 18083–18113.Google Scholar
- [37] (2022) A fully single loop algorithm for bilevel optimization without hessian inverse. Proc. AAAI Conf. Artificial Intelligence 36(7):7426–7434.Google Scholar
- [38] (2023) A novel approach for bilevel programs based on Wolfe duality. Preprint, submitted February 14, https://arxiv.org/abs/2302.06838.Google Scholar
- [39] (2015) An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4):2244–2273.Crossref, Google Scholar
- [40] (2018) Darts: Differentiable architecture search. Internat. Conf. Learn. Representations, vol. 176 (Elsevier, Amsterdam), 42–48.Google Scholar
- [41] (2021) Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. IEEE Trans. Pattern Anal. Machine Intelligence 44(12):10045–10067.Crossref, Google Scholar
- [42] (2022) Bome! Bilevel optimization made easy: A simple first-order approach. Adv. Neural Inform. Processing Systems, vol. 35 (Curran Associates Inc., Red Hook, NY), 17248–17262.Google Scholar
- [43] (2017) Gradient episodic memory for continual learning. Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 30 (Curran Associates Inc., Red Hook, NY), 6470–6479.Google Scholar
- [44] (2024) A first-order method for nonconvex-strongly-concave constrained minimax optimization. Preprint, submitted January 5, 2023, https://arxiv.org/abs/2301.02060.Google Scholar
- [45] (2024) First-order penalty methods for bilevel optimization. SIAM J. Optim. 34(2):1937–1969.Crossref, Google Scholar
- [46] (2025) A first-order augmented Lagrangian method for constrained minimax optimization. Math. Programming 213:1063–1104.Crossref, Google Scholar
- [47] (2023) Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming. SIAM J. Optim. 33(2):1159–1190.Crossref, Google Scholar
- [48] (1996) Mathematical Programs with Equilibrium Constraints (Cambridge University Press, Cambridge, UK). Crossref, Google Scholar
- [49] (2021) Combined approach with second-order optimality conditions for bilevel programming problems. Preprint, submitted July 31, https://arxiv.org/abs/2108.00179.Google Scholar
- [50] (2015) Gradient-based hyperparameter optimization through reversible learning. Internat. Conf. Machine Learn. (PMLR, New York), 2113–2122.Google Scholar
- [51] (2018) Towards deep learning models resistant to adversarial attacks. Internat. Conf. Learn. Representations (MIT Press, Cambridge, MA).Google Scholar
- [52] (1999) The theory of moral hazard and unobservable behaviour: Part I. Rev. Econom. Stud. 66(1):3–21.Crossref, Google Scholar
- [53] (2004) Introductory Lectures on Convex Optimization: A Basic Course (Kluwer Academic Publishers, MA).Crossref, Google Scholar
- [54] (2021) On ℓp-hyperparameter learning via bilevel nonsmooth optimization. J. Machine Learn. Res. 22(1):11093–11139.Google Scholar
- [55] (1998) Nonsmooth Approach to Optimization Problems with Equilibrium Constraints: Theory, Applications and Numerical Results, vol. 28 (Springer Science & Business Media, New York).Crossref, Google Scholar
- [56] (2016) Hyperparameter optimization with approximate gradient. Internat. Conf. Machine Learn. (PMLR, New York), 737–746.Google Scholar
- [57] (2019) Meta-learning with implicit gradients. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates Inc., Red Hook, NY), 113–124.Google Scholar
- [58] (2023) On penalty-based bilevel gradient descent method. Internat. Conf. Machine Learn. (PMLR, New York), 30992–31015.Google Scholar
- [59] (2013) CU splitting early termination based on weighted SVM. EURASIP J. Image Video Processing 2013(1):4.Crossref, Google Scholar
- [60] (2005) An extended Kuhn–Tucker approach for linear bilevel programming. Appl. Math. Comput. 162(1):51–63.Crossref, Google Scholar
- [61] (2012) Nondifferentiable and Two-Level Mathematical Programming (Springer Science & Business Media, New York).Google Scholar
- [62] (2017) A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Trans. Evolutionary Comput. 22(2):276–295.Crossref, Google Scholar
- [63] (2022) A primal-dual approach to bilevel optimization with multiple inner minima. Preprint, submitted March 1, https://arxiv.org/abs/2203.01123.Google Scholar
- [64] (2013) Intriguing properties of neural networks. Preprint, submitted December 21, https://arxiv.org/abs/1312.6199.Google Scholar
- [65] (2008) On Accelerated Proximal Gradient Methods for Convex-Concave Optimization (MIT Press, Cambridge, MA), 1–20.Google Scholar
- [66] (1994) Bilevel and multilevel programming: A bibliography review. J. Global Optim. 5(3):291–306.Crossref, Google Scholar
- [67] (2010) Market Structure and Equilibrium (Springer Science & Business Media, New York).Google Scholar
- [68] (2023) Effective bilevel optimization via minimax reformulation. Preprint, submitted May 22, https://arxiv.org/abs/2305.13153.Google Scholar
- [69] (1987) Nonsmooth calculus in finite dimensions. SIAM J. Control Optim. 25(5):1312–1340.Crossref, Google Scholar
- [70] (2014) A weighted support vector machine method for control chart pattern recognition. Computers Indust. Engrg. 70:134–149.Crossref, Google Scholar
- [71] (2021) Provably faster algorithms for bilevel optimization. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 13670–13682.Google Scholar
- [72] (2007) A weighted support vector machine for data classification. Internat. J. Pattern Recognition Artificial Intellience 21(5):961–976.Crossref, Google Scholar
- [73] (2024) Overcoming lower-level constraints in bilevel optimization: A novel approach with regularized gap functions. Preprint, submitted June 4, https://arxiv.org/abs/2406.01992.Google Scholar
- [74] (2024) Constrained bi-level optimization: Proximal Lagrangian value function approach and Hessian-free algorithm. Preprint, submitted January 29, https://arxiv.org/abs/2401.16164.Google Scholar
- [75] (2020) Constraint qualifications and optimality conditions in bilevel optimization. Bilevel Optimization (Springer, Berlin), 227–251.Crossref, Google Scholar
- [76] (2023) Difference of convex algorithms for bilevel programs with applications in hyperparameter selection. Math. Programming 198(2):1583–1616.Crossref, Google Scholar

