Wasserstein Distributionally Robust Shallow Convex Neural Networks

Julien Pallage
Corresponding Author
Julien Pallage
[email protected]
https://orcid.org/0009-0001-1689-3021
Department of Electrical Engineering, Polytechnique Montréal, Montréal, Québec H3T 0A3, Canada; and GERAD, Montréal, Québec H3T 2A7, Canada; and Mila, Montréal, Québec H2S 3H1, Canada
Search for more papers by this author
,
Antoine Lesage-Landry
Antoine Lesage-Landry
[email protected]
https://orcid.org/0000-0001-9652-6557
Department of Electrical Engineering, Polytechnique Montréal, Montréal, Québec H3T 0A3, Canada; and GERAD, Montréal, Québec H3T 2A7, Canada; and Mila, Montréal, Québec H2S 3H1, Canada
Search for more papers by this author

Corresponding Author

Julien Pallage

Department of Electrical Engineering, Polytechnique Montréal, Montréal, Québec H3T 0A3, Canada; and GERAD, Montréal, Québec H3T 2A7, Canada; and Mila, Montréal, Québec H2S 3H1, Canada

Search for more papers by this author

Antoine Lesage-Landry

[email protected]

https://orcid.org/0000-0001-9652-6557

Department of Electrical Engineering, Polytechnique Montréal, Montréal, Québec H3T 0A3, Canada; and GERAD, Montréal, Québec H3T 2A7, Canada; and Mila, Montréal, Québec H2S 3H1, Canada

Search for more papers by this author

Published Online:26 Aug 2025https://doi.org/10.1287/ijoo.2024.0048

References

Ackley D (2012) A Connectionist Machine for Genetic Hillclimbing, vol. 28 (Springer Science & Business Media, New York).Google Scholar
Adorio EP (2005) MVF—Multivariate test functions library in C for unconstrained global optimization. Technical report, Department of Mathematics, U.P. Diliman, Manila, Philippines.Google Scholar
Albarghouthi A (2021) Introduction to neural network verification. Foundations Trends Programming Languages 7(1–2):1–157.Google Scholar
Amari S-i (1993) Backpropagation and stochastic gradient descent method. Neurocomputing 5(4–5):185–196.Google Scholar
Amasyali K, El-Gohary NM (2018) A review of data-driven building energy consumption prediction studies. Renewable Sustainable Energy Rev. 81(1):1192–1205.Google Scholar
Audet C, Le Digabel S, Montplaisir VR, Tribes C (2022) Algorithm 1027: NOMAD version 4: Nonlinear optimization with the MADS algorithm. ACM Trans. Math. Software 48(3):1–22.Google Scholar
Bai Y, Gautam T, Sojoudi S (2023b) Efficient global optimization of two-layer ReLU networks: Quadratic-time algorithms and adversarial training. SIAM J. Math. Data Sci. 5(2):446–474.Google Scholar
Bai X, He G, Jiang Y, Obloj J (2023a) Wasserstein distributional robustness of neural networks. Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, eds. Adv. Neural Inform. Processing Systems, vol. 36 (Curran Associates, Inc., Red Hook, NY), 26322–26347.Google Scholar
Baronti L, Castellani M (2024) A Python benchmark functions framework for numerical optimisation problems. Technical report, School of Computer Science, University of Birmingham, UK.Google Scholar
Bartlett PL, Mendelson S (2002) Rademacher and Gaussian complexities: Risk bounds and structural results. J. Machine Learn. Res. 3:463–482.Google Scholar
Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017) Automatic differentiation in machine learning: A survey. J. Machine Learn. Res. 18(1):5595–5637.Google Scholar
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization, vol. 24, 1–9.Google Scholar
Bergstra J, Komer B, Eliasmith C, Yamins D, Cox DD (2015) Hyperopt: A Python library for model selection and hyperparameter optimization. Computational Sci. Discovery 8(1):014008.Google Scholar
Bishop CM, Bishop H (2023) Deep Learning: Foundations and Concepts (Springer Nature, New York).Google Scholar
Bishop CM, Nasrabadi NM (2006) Pattern Recognition and Machine Learning (Springer, New York).Google Scholar
Bonneel N, Rabin J, Peyré G, Pfister H (2015) Sliced and Radon Wasserstein barycenters of measures. J. Math. Imaging Vision 51:22–45.Google Scholar
Bonnotte N (2013) Unidimensional and evolution methods for optimal transportation. Unpublished PhD thesis, Université Paris Sud-Paris XI, Scuola normale superiore, Pise, Italie.Google Scholar
Boyd S, Park J (2014) Subgradient methods. Notes for EE364b, Stanford University, Spring 2014. https://web.stanford.edu/class/ee364b/lectures/subgrad_method_notes.pdf.Google Scholar
Boyd S, Vandenberghe L (2004) Convex Optimization (Cambridge University Press, Cambridge, UK).Google Scholar
Chatzivasileiadis S, Venzke A, Stiasny J, Misyris G (2022) Machine learning in power systems: Is it time to trust it? IEEE Power Energy Magazine 20(3):32–41.Google Scholar
Chen R, Paschalidis IC (2018) A robust learning approach for regression models based on distributionally robust optimization. J. Machine Learn. Res. 19(13):1–48.Google Scholar
Chen R, Paschalidis IC (2020) Distributionally robust learning. Foundations Trends Optim. 4(1–2):1–243.Google Scholar
Cuomo S, Di Cola VS, Giampaolo F, Rozza G, Raissi M, Piccialli F (2022) Scientific machine learning through physics-informed neural networks: Where we are and what’s next. J. Sci. Comput. 92(3):88.Google Scholar
Dempe S (2002) Foundations of Bilevel Programming (Springer Science & Business Media, New York).Google Scholar
Diamond S, Boyd S (2016) CVXPY: A Python-embedded modeling language for convex optimization. J. Machine Learn. Res. 17(83):1–5.Google Scholar
Dong B, Cao C, Lee SE (2005) Applying support vector machines to predict building energy consumption in tropical region. Energy Buildings 37(5):545–553.Google Scholar
Gao H, Sun L, Wang JX (2021) PhyGeoNet: Physics-informed geometry-adaptive convolutional neural networks for solving parameterized steady-state PDEs on irregular domain. J. Comput. Phys. 428:110079.Google Scholar
Giudici P, Raffinetti E (2023) SAFE artificial intelligence in finance. Finance Res. Lett. 56:104088.Google Scholar
Goulart PJ, Chen Y (2024) Clarabel: An interior-point solver for conic programs with quadratic objectives. Technical report, Department of Engineering Science, University of Oxford, Oxford, UK.Google Scholar
Householder AS (1941) A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas. Bull. Math. Biophysics 3:63–69.Google Scholar
Huang X, Kwiatkowska M, Wang S, Wu M (2017) Safety verification of deep neural networks. Comput. Aided Verification 29th Internat. Conf. Proc. (Springer, New York), 3–29.Google Scholar
Huang Y, Zhang H, Shi Y, Kolter JZ, Anandkumar A (2021) Training certifiably robust neural networks with efficient local Lipschitz bounds. Ranzato M, Beygelzimer A, Dauphin Y, Liang P, Vaughan JW, eds. Adv. Neural Inform. Processing Systems, vol. 34 (Curran Associates, Inc., Red Hook, NY), 22745–22757.Google Scholar
Ingber L (1993) Simulated annealing: Practice versus theory. Math. Comput. Model. 18(11):29–57.Google Scholar
Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physics-informed machine learning. Nature Rev Phys. 3(6):422–440.Google Scholar
Kawaguchi K, Sun Q (2021) A recipe for global convergence guarantee in deep neural networks. Proc. AAAI Conf. Artificial Intelligence, vol. 35, 8074–8082.Google Scholar
Keane A (1994) Experiences with optimizers in structural design. Proc. Conf. Adaptive Comput. Engrg. Design Control, vol. 94, 14–27.Google Scholar
Kelly M, Longjohn R, Nottingham K (2023) UCI machine learning repository. Accessed August 2024, http://archive.ics.uci.edu/ml.Google Scholar
Kolouri S, Nadjahi K, Simsekli U, Badeau R, Rohde G (2019) Generalized sliced Wasserstein distances. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Adv. Neural Inform. Processing Systems, vol. 32 (Curran Associates, Inc., Red Hook, NY), 1–12.Google Scholar
Kreuzberger D, Kühl N, Hirschl S (2023) Machine learning operations (MLOps): Overview, definition, and architecture. IEEE Access 11:31866–31879.Google Scholar
Kuelbs D, Lall S, Pilanci M (2024) Adversarial training of two-layer polynomial and ReLU activation networks via convex optimization. Preprint, submitted May 22, https://arxiv.org/abs/2405.14033.Google Scholar
Kuhn D, Esfahani PM, Nguyen VA, Shafieezadeh-Abadeh S (2019) Wasserstein distributionally robust optimization: Theory and applications in machine learning. Operations Research & Management Science in the Age of Analytics (INFORMS, Catonsville, MY), 130–166.Link, Google Scholar
Levy D, Carmon Y, Duchi JC, Sidford A (2020) Large-scale methods for distributionally robust optimization. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Adv. Neural Inform. Processing Systems, vol. 33 (Curran Associates, Inc., Red Hook, NY), 8847–8860.Google Scholar
Liu J, Shen Z, Cui P, Zhou L, Kuang K, Li B (2022) Distributionally robust learning with stable adversarial training. IEEE Trans. Knowledge Data Engrg. 35(11):11288–11300.Google Scholar
Lu L, Pestourie R, Yao W, Wang Z, Verdugo F, Johnson SG (2021) Physics-informed neural networks with hard constraints for inverse design. SIAM J. Sci. Comput. 43(6):B1105–B1132.Google Scholar
Massana J, Pous C, Burgas L, Melendez J, Colomer J (2015) Short-term load forecasting in a non-residential building contrasting models and attributes. Energy Buildings 92:322–330.Google Scholar
Miria Feng ZF, Pilanci M (2024) CRONOS: Enhancing deep learning with scalable GPU accelerated convex neural networks. Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, Zhang C, eds. Adv. Neural Inform. Processing Systems, vol. 37 (Curran Associates, Inc., Red Hook, NY), 102973–103004.Google Scholar
Mishkin A, Sahiner A, Pilanci M (2022) Fast convex optimization for two-layer ReLU networks: Equivalent model classes and cone decompositions. Chaudhuri K, Jegelka S, Song L, Szepesvari C, Niu G, Sabato S, eds. Proc. 39th Internat. Conf. Machine Learn., vol. 162 (Proceedings of Machine Learning Research, New York), 15770–15816.Google Scholar
Mohajerin Esfahani P, Kuhn D (2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Programming 171(1):115–166.Google Scholar
Moon J, Park J, Hwang E, Jun S (2018) Forecasting power consumption for higher educational institutions based on machine learning. J. Supercomputing 74:3778–3800.Google Scholar
Newsham GR, Birt BJ (2010) Building-level occupancy data to improve ARIMA-based electricity use forecasts. Proc. 2nd ACM Workshop Embedded Sensing Systems Energy-Efficiency Building (Association for Computing Machinery, New York), 13–18.Google Scholar
Pallage J, Lesage-Landry A (2025) Sliced-Wasserstein distance-based data selection. Preprint, submitted April 17, https://arxiv.org/abs/2504.12918.Google Scholar
Pallage J, Scherrer B, Naccache S, Bélanger C, Lesage-Landry A (2024) Sliced-Wasserstein-based anomaly detection and open dataset for localized critical peak rebates. NeurIPS 2024 Workshop Tackling Climate Change Machine Learn.Google Scholar
Panaretos VM, Zemel Y (2019) Statistical aspects of Wasserstein distances. Annual Rev. Statist. Appl. 6(1):405–431.Google Scholar
Panaretos VM, Zemel Y (2020) An Invitation to Statistics in Wasserstein Space (Springer Nature, New York).Google Scholar
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, et al. (2019) Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inform. Processing Systems, vol. 32, 8024–8035.Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825–2830.Google Scholar
Picheny V, Wagner T, Ginsbourger D (2013) A benchmark of kriging-based infill criteria for noisy optimization. Structural Multidisciplinary Optim. 48:607–626.Google Scholar
Pilanci M, Ergen T (2020) Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks. Internat. Conf. Machine Learn. (PMLR, New York), 7695–7705.Google Scholar
Pudjianto D, Ramsay C, Strbac G (2007) Virtual power plant and system integration of distributed energy resources. IET Renewable Power Generation 1(1):10–16.Google Scholar
Qi M, Cao Y, Shen ZJ (2022) Distributionally robust conditional quantile prediction with fixed design. Management Sci. 68(3):1639–1658.Link, Google Scholar
Rasheed K, Qayyum A, Ghaly M, Al-Fuqaha A, Razi A, Qadir J (2022) Explainable, trustworthy, and ethical machine learning for healthcare: A survey. Computers Biol. Medicine 149:106043.Google Scholar
Sagawa S, Koh PW, Hashimoto TB, Liang P (2020) Distributionally robust neural networks. Internat. Conf. Learn. Representations (ICLR, Appleton, WI).Google Scholar
Shafieezadeh-Abadeh S, Kuhn D, Esfahani PM (2019) Regularization via mass transportation. J. Machine Learn. Res. 20(103):1–68.Google Scholar
Siano P (2014) Demand response and smart grids—A survey. Renewable Sustainable Energy Rev. 30:461–478.Google Scholar
Sohl-Dickstein J (2024) The boundary of neural network trainability is fractal. Preprint, submitted February 9, https://arxiv.org/abs/2402.06184.Google Scholar
Stiasny J, Chevalier S, Nellikkath R, Sævarsson B, Chatzivasileiadis S (2022) Closing the loop: A framework for trustworthy machine learning in power systems. Proc. 11th Bulk Power Systems Dynam. Control Sympos., 1–21.Google Scholar
Tsanas A, Xifara A (2012) Energy efficiency. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/242/energy+efficiency.Google Scholar
van der Vaart AW (2000) Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics (Cambridge University Press, Cambridge, UK).Google Scholar
Venzke A, Qu G, Low S, Chatzivasileiadis S (2020) Learning optimal power flow: Worst-case guarantees for neural networks. 2020 IEEE Internat. Conf. Comm. Control Comput. Tech. Smart Grids, vol. 11, 1–7.Google Scholar
Wang Y, Lacotte J, Pilanci M (2021) The hidden convex optimization landscape of regularized two-layer ReLU networks: An exact characterization of optimal solutions. Internat. Conf. Learn. Representations, 1–26.Google Scholar
Weng Y, Rajagopal R (2015) Probabilistic baseline estimation via Gaussian process. 2015 IEEE Power Energy Soc. General Meeting (IEEE, Piscataway, NJ), 1–5.Google Scholar
Whitley D, Rana S, Dzubera J, Mathias KE (1996) Evaluating evolutionary algorithms. Artificial Intelligence 85(1–2):245–276.Google Scholar
Wiggins S (2003) Introduction to Applied Nonlinear Dynamical Systems and Chaos, vol. 4.Google Scholar
Williams HP (1978) Model Building in Mathematical Programming (John Wiley & Sons, New York).Google Scholar
Xu Y, Kohtz S, Boakye J, Gardoni P, Wang P (2023) Physics-informed machine learning for reliability and systems safety applications: State of the art and challenges. Reliability Engrg. System Safety 230:108900.Google Scholar
Yang L, Meng X, Karniadakis GE (2021) B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data. J. Comput. Phys. 425:109913.Google Scholar
Yang L, Zhang D, Karniadakis GE (2020) Physics-informed generative adversarial networks for stochastic differential equations. SIAM J. Sci. Comput. 42(1):A292–A317.Google Scholar
Yeh IC (1998) Concrete compressive strength. UCI machine learning repository. https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength.Google Scholar
Yue MC, Kuhn D, Wiesemann W (2022) On linear optimization over Wasserstein balls. Math. Programming 195(1):1107–1122.Google Scholar
Zhang L, Jeong D, Lee S (2021) Data quality management in the internet of things. Sensors 21(17):5834.Google Scholar
Zhang R, Liu Y, Sun H (2020) Physics-informed multi-LSTM networks for metamodeling of nonlinear structures. Comput. Methods Appl. Mech. Engrg. 369:113226.Google Scholar
Zhou D, Brix C, Hanasusanto GA, Zhang H (2024) Scalable neural network verification with branch-and-bound inferred cutting planes. Proc. Neural Inform. Processing Systems, vol. 37 (Curran Associates, Inc., Red Hook, NY).Google Scholar
Zhu Q, Liu Z, Yan J (2021) Machine learning for metal additive manufacturing: Predicting temperature and melt pool fluid dynamics using physics-informed neural networks. Comput. Mech. 67:619–635.Google Scholar

cover image INFORMS Journal on Optimization

Volume 8, Issue 1

Winter 2026

Pages 1-93, ii

Article Information

Metrics

Information

Received:August 09, 2024
Accepted:July 19, 2025
Published Online:August 26, 2025

Cite as

Julien Pallage, Antoine Lesage-Landry (2025) Wasserstein Distributionally Robust Shallow Convex Neural Networks. INFORMS Journal on Optimization 8(1):61-93.

https://doi.org/10.1287/ijoo.2024.0048

Keywords

Acknowledgments

Special thanks to Salma Naccache and Bertrand Scherrer for their active support and the enriching discussions as well as Steve Boursiquot, Ahmed Abdellatif, and Odile Noël from Hilo for making this project possible.

PDF download

Available Issues

Available Issues

Available Issues

Wasserstein Distributionally Robust Shallow Convex Neural Networks

References

Volume 8, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News