Parallel Nonstationary Direct Policy Search for Risk-Averse Stochastic Optimization

Somayeh Moazeni
Corresponding Author
Somayeh Moazeni
[email protected]
School of Systems and Enterprises, Stevens Institute of Technology, Hoboken, New Jersey 07030
Search for more papers by this author
,
Warren B. Powell
Warren B. Powell
[email protected]
Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544
Search for more papers by this author
,
Boris Defourny
Boris Defourny
[email protected]
Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, Pennsylvania 18015
Search for more papers by this author
,
Belgacem Bouzaiene-Ayari
Belgacem Bouzaiene-Ayari
[email protected]
Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544
Search for more papers by this author

Somayeh Moazeni

Corresponding Author

Somayeh Moazeni

[email protected]

School of Systems and Enterprises, Stevens Institute of Technology, Hoboken, New Jersey 07030

Search for more papers by this author

Warren B. Powell

[email protected]

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544

Search for more papers by this author

Boris Defourny

[email protected]

Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, Pennsylvania 18015

Search for more papers by this author

Belgacem Bouzaiene-Ayari

[email protected]

Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey 08544

Search for more papers by this author

Published Online:12 Apr 2017https://doi.org/10.1287/ijoc.2016.0733

References

Audet C, Dennis JEJ (2002) Analysis of generalized pattern searches. SIAM J. Optim. 13(3):889–903.Crossref, Google Scholar
Baxter J, Bartlett PL (2001) Infinite-horizon policy-gradient estimation. J. Artificial Intelligence Res. 15(1):319–350.Crossref, Google Scholar
Bertsekas DP (2005) Dynamic Programming and Optimal Control, Vol. 1 3rd ed. (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP (2012) Dynamic Programming and Optimal Control, Vol. II, 4th ed., Approximate Dynamic Programming (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas D, Tsitsiklis (1989) Parallel and Distributed Computation: Numerical Methods (Prentice-Hall, Englewood Cliffs, NJ).Google Scholar
Bertsekas DP, Tsitsiklis J (1996) Neuro-Dynamic Programming, 1st ed. (Athena Scientific, Belmont, MA).Google Scholar
Bertsimas D, Goyal V (2012) On the power and limitations of affine policies in two-stage adaptive optimization. Math. Programming 134(2):491–531.Crossref, Google Scholar
Bertsimas D, Iancu DA, Parrilo PA (2010) Optimality of affine policies in multistage robust optimization. Math. Oper. Res. 35(2):363–394.Link, Google Scholar
Brown PD, Lopes JAP, Matos MA (2008) Optimization of pumped storage capacity in an isolated power system with large renewable penetration. IEEE Trans. Power Systems 23(2):523–531.Crossref, Google Scholar
Burger M, Graeber B, Schindlmayr G (2008) Managing Energy Risk: An Integrated View on Power and Other Energy Markets (John Wiley and Sons, Chichester, UK).Google Scholar
Busoniu L, Babuska R, Schutter BD, Ernst D (2010) Reinforcement Learning and Dynamic Programming Using Function Approximators, 1st ed. (CRC Press, Boca Raton, FL).Crossref, Google Scholar
Carmona R, Ludkovski M (2010) Valuation of energy storage: An optimal switching approach. Quant. Finance 10(4):359–374.Crossref, Google Scholar
Censor Y, Zenios SA (1998) Parallel Optimization: Theory, Algorithms, and Applications (Oxford University Press, New York).Crossref, Google Scholar
Chen VCP, Ruppert D, Shoemaker CA (1999) Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Oper. Res. 47(1):38–53.Link, Google Scholar
Coleman TF, Li Y, Patron C (2007) Total risk minimization using Monte Carlo simulations. Birge JR, Linetsky V, eds. Handbooks in Operations Research and Management Science (Elsevier, Amsterdam), 593–635.Google Scholar
Coleman TF, Li Y, Verma A (1999) Reconstructing the unknown local volatility function. J. Comput. Finance 2(3):77–102.Crossref, Google Scholar
Defourny B, Ernst D, Wehenkel L (2008) Risk-aware decision making and dynamic programming. Proc. 22nd Annual Conf. Neural Inform. Processing Systems, Vancouver, Canada, 1–8.Google Scholar
Dixon LCW, Jha M (1993) Parallel algorithms for global optimization. J. Optim. Theory Appl. 79(2):385–395.Crossref, Google Scholar
Dolan ED, Lewis RM, Torczon V (2003) On the local convergence of pattern search. SIAM J. Optim. 14(2):567–583.Crossref, Google Scholar
Duffie D, Pan J (1997) An overview of value at risk. J. Derivatives 4(3):7–49.Crossref, Google Scholar
Giuliani M, Castelletti A, Pianosi F, Mason E, Reed P (2015) Curses, tradeoffs, and scalable management: Advancing evolutionary multiobjective direct policy search to improve water reservoir operations. J. Water Resources Planning Management 142(2): Article no. 04015050.Crossref, Google Scholar
Golub G, Ortega JM (1993) Scientific Computing: An Introduction with Parallel Computing (Academic Press, San Diego).Crossref, Google Scholar
Gonzalez JG, de la Muela RMR, Santos LM, Gonzalez AM (2008) Stochastic joint optimization of wind generation and pumped-storage units in an electricity market. IEEE Trans. Power Systems 23(2):460–468.Crossref, Google Scholar
Grondman I, Busoniu L, Lopes GAD, Babuska R (2012) A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Systems, Man, Cybernetics—Part C: Appl. Rev. 42(6):1291–1307.Crossref, Google Scholar
Harsha P, Dahleh M (2015) Optimal management and sizing of energy storage under dynamic pricing for the efficient integration of renewable energy. IEEE Trans. Power Systems 30(3):1164–1181.Crossref, Google Scholar
Hough PD, Kolda TG, Torczon VJ (2001) Asynchronous parallel pattern search for nonlinear optimization. SIAM J. Sci. Comput. 23(1):134–156.Crossref, Google Scholar
Johnson SA, Stedinger JR, Shoemaker CA, Li Y, Tejada-Guibert JA (1993) Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Oper. Res. 41(3):484–500.Link, Google Scholar
Jones DR (2001) A taxonomy of global optimization methods based on response surfaces. J. Global Optim. 21(4):345–383.Crossref, Google Scholar
Kakade SM (2002) A natural policy gradient. Dietterich TG, Becker S, Ghahramami Z, eds. Proc. 14th Internat. Conf. Neural Inform. Processing Systems: Natural Synthentic (MIT Press, Cambridge, MA), 1531–1538.Google Scholar
Keller PW, Mannor S, Precup D (2006) Automatic basis function construction for approximate dynamic programming and reinforcement learning. Proc. 23rd Internat. Conf. Machine Learn. (ACM, New York), 449–456.Crossref, Google Scholar
Kim JH, Powell WB (2011) Optimal energy commitments with storage and intermittent supply. Oper. Res. 59(6):1347–1360.Link, Google Scholar
Kober JR, Peters JR (2009) Policy search for motor primitives in robotics. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Adv. Neural Inform. Processing Systems 21 (Curran Associates, Inc., Red Hook, NY), 849–856.Google Scholar
Kolda TG (2005) Revisiting asynchronous parallel pattern search for nonlinear optimization. SIAM J. Optim. 16(2):563–586.Crossref, Google Scholar
Kormushev P, Caldwell DG (2012) Direct policy search reinforcement learning based on particle filtering. 10th Eur. Workshop Reinforcement Learn. (EWRL 2012), Edinburgh, UK.Google Scholar
Kuhn D, Wiesemann W, Georghiou A (2011) Primal and dual linear decision rules in stochastic and robust optimization. Math. Programming 130(1):177–209.Crossref, Google Scholar
Lai G, Margot F, Secomandi N (2010) An approximate dynamic programming approach to benchmark practice-based heuristics for natural gas storage valuation. Oper. Res. 58(3):564–582.Link, Google Scholar
Lewis RM, Torczon V, Trosset MW (1998) Why pattern search works. Optima (59):1–7.Google Scholar
Lewis RM, Torczon V, Trosset MW (2000) Direct search methods: Then and now. J. Comput. Appl. Math. 124(1):191–207.Crossref, Google Scholar
MacKay DJ (2009) Sustainable Energy-Without the Hot Air, 1st ed. (UIT Cambridge Ltd., Cambridge, UK).Google Scholar
Mannor S, Rubinstein R, Gat Y (2003) The cross entropy method for fast policy search. Proc. 20th Internat. Conf. Machine Learn. (ICML-2003), Washington, DC, 512–519.Google Scholar
Moazeni S, Coleman TF, Li Y (2016) Smoothing and parametric rules for stochastic mean-CVAR optimal execution strategy. Ann. Oper. Res. 237(1):99–120.Crossref, Google Scholar
Moazeni S, Powell WB, Hajimiragha AH (2015) Mean-conditional value-at-risk optimal energy storage operation in the presence of transaction costs. IEEE Trans. Power Systems 30(3):1222–1232.Crossref, Google Scholar
Munos R (2003) Error bounds for approximate policy iteration. Proc. 20th Internat. Conf. Machine Learn. (ICML-2003), Washington, DC, 560–567.Google Scholar
Ng AY, Jordan M (2000) PEGASUS: A policy search method for large MDPs and POMDPs. Proc. 16th Conf. Uncertainty Artificial Intelligence (Morgan Kaufmann Publishers Inc., San Francisco),406–415.Google Scholar
Nocedal J, Wright S (2006) Numerical Optimization, 2nd ed. (Springer, New York).Google Scholar
Peters J, Vijayakumar S, Schaal S (2005) Natural actor-critic. Gama J, Camacho R, Brazdil PB, Jorge AM, Torgo L, eds. Machine Learning: ECML 2005: 16th Eur. Conf. Maching Learn., Lecture Notes in Computer Science, Vol. 3720 (Springer, Berlin), 280–291.Crossref, Google Scholar
Powell WB (2011) Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. (John Wiley & Sons, New York).Crossref, Google Scholar
Riedel F (2004) Dynamic coherent risk measures. Stochastic Processes Their Appl. 112(2):185–200.Crossref, Google Scholar
Rios LM, Sahinidis NV (2013) Derivative-free optimization: A review of algorithms and comparison of software implementations. J. Global Optim. 56(3):1247–1293.Crossref, Google Scholar
Rockafellar RT, Uryasev S (2000) Optimization of conditional value-at-risk. J. Risk 2(3):21–41.Crossref, Google Scholar
Rudloff B, Street A, Valladao DM (2014) Time consistency and risk averse dynamic decision models: Definition, interpretation and practical consequences. Eur. J. Oper. Res. 234(3):743–750.Crossref, Google Scholar
Scott WR, Frazier P, Powell WB (2011) The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM J. Optim. 21(3):996–1026.Crossref, Google Scholar
Scott WR, Powell WB, Moazeni S (2014) Least squares policy iteration with instrumental variables vs. direct policy search: Comparison against optimal benchmarks using energy storage. http://arxiv.org/pdf/1401.0843v1.pdf.Google Scholar
Shafi A, Carpenter B, Baker M (2009) Nested parallelism for multi-core HPC systems using java. J. Parallel Distributed Comput. 69(6):532–545.Crossref, Google Scholar
Shapiro A (2012) Time consistency of dynamic risk measures. Oper. Res. Lett. 40(6):436–439.Crossref, Google Scholar
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M (2014) Deterministic policy gradient algorithms. Proc. 31st Internat. Conf. Machine Learn., Beijing, I-387–I-395.Google Scholar
Strens MJA, Moore AW (2003) Policy search using paired comparisons. J. Machine Learn. Res. 3(3):921–950.Google Scholar
Sutton R, Barto A (1998) Reinforcement Learning: An Introduction (MIT Press, Cambridge, MA).Google Scholar
Sutton RS, McAllester D, Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, Vol. 12 (MIT Press, Cambridge, MA), 1057–1063.Google Scholar
Torczon V (1997) On the convergence of pattern search algorithms. SIAM J. Optim. 7(1):1–25.Crossref, Google Scholar
Tsitsiklis JN, Roy BV (1996) Feature-based methods for large scale dynamic programming. Machine Learn. 22(1):59–94.Crossref, Google Scholar
Yamakawa E, Fukushima M (1996) A block-parallel conjugate gradient method for separable quadratic programming problems. J. Oper. Res. Soc. Japan 39(3):407–427.Crossref, Google Scholar
Yu H, Bertsekas DP (2009) Basis function adaptation methods for cost approximation in MDP. IEEE Sympos. Adaptive Dynam. Programming Reinforcement Learn., Nashville, TN, 74–81.Crossref, Google Scholar
Yu H, Bertsekas DP (2010) Error bounds for approximations from projected linear equations. Math. Oper. Res. 35(2):306–329.Link, Google Scholar
Zhou Y, Scheller-Wolf A, Secomandi N, Smith S (2014) Managing wind-based electricity generation in the presence of storage and transmission capacity. Tepper Working Paper 2011-E36 1–38, Carnegie Mellon University, Pittsburgh, PA.Google Scholar

cover image INFORMS Journal on Computing

Volume 29, Issue 2

Spring 2017

Pages 197-376

Article Information

Supplemental Material

Metrics

Information

Received:March 16, 2015
Accepted:September 13, 2016
Published Online:April 12, 2017

Cite as

Somayeh Moazeni, Warren B. Powell, Boris Defourny, Belgacem Bouzaiene-Ayari (2017) Parallel Nonstationary Direct Policy Search for Risk-Averse Stochastic Optimization. INFORMS Journal on Computing 29(2):332-349.

https://doi.org/10.1287/ijoc.2016.0733

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Parallel Nonstationary Direct Policy Search for Risk-Averse Stochastic Optimization

References

Volume 29, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News