Analytical Solution to a Discrete-Time Model for Dynamic Learning and Decision Making

Hao Zhang
Hao Zhang
[email protected]
https://orcid.org/0000-0002-5078-9252
Sauder School of Business, University of British Columbia, Vancouver, British Columbia V6T 1Z2, Canada
Search for more papers by this author

Sauder School of Business, University of British Columbia, Vancouver, British Columbia V6T 1Z2, Canada

Search for more papers by this author

Published Online:1 Feb 2022https://doi.org/10.1287/mnsc.2021.4194

References

Afèche P, Ata B (2013) Bayesian dynamic pricing in queueing systems with unknown delay cost characteristics. Manufacturing Service Oper. Management 15(2):292–304.Link, Google Scholar
Aghion P, Bolton P, Harris C, Jullien B (1991) Optimal learning by experimentation. Rev. Econom. Stud. 58:621–654.Crossref, Google Scholar
Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs. Proc. 30th Internat. Conf. on Machine Learn (PMLR), 127–135.Google Scholar
Alizamir S, de Véricourt F, Sun P (2013) Diagnostic accuracy under congestion. Management Sci. 59(1):157–171.Link, Google Scholar
Alt B, Schultheis M, Koeppl H (2020) POMDPs in continuous time and discrete spaces. Proc. 34th Conf. on Neural Inform. Processing Systems, NeurIPS, Vancouver, Canada.Google Scholar
Araman VF, Caldentey R (2009) Dynamic pricing for nonperishable products with demand learning. Oper. Res. 57(5):1169–1188.Link, Google Scholar
Araman VF, Caldentey R (2020) Diffusion approximations for a class of sequential experimentation problems. Preprint, submitted November 2, 2019, https://dx.doi.org/10.2139/ssrn.3479676.Google Scholar
Arrow KJ, Blackwell D, Girshick MA (1949) Bayes and minimax solutions of sequential decision problems. Econometrica 17:213–244.Crossref, Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2/3):235–256.Crossref, Google Scholar
Aurentz JL, Mach T, Vandebril R, Watkins DS (2015) Fast and backward stable computation of roots of polynomials. SIAM J. Matrix Anal. Appl. 36(3):942–973.Crossref, Google Scholar
Aviv Y, Pazgal A (2005) A partially observed Markov decision process for dynamic pricing. Management Sci. 51:1400–1416.Link, Google Scholar
Ayer T, Alagoz O, Stout NK (2012) OR Forum—A POMDP approach to personalize mammography screening decisions. Oper. Res. 60(5):1019–1034.Link, Google Scholar
Bain A, Crisan D (2008) Fundamentals of Stochastic Filtering (Springer, New York).Google Scholar
Bartroff J, Lai TL, Shih M-C (2013) Sequential Experimentation in Clinical Trials: Design and Analysis (Springer, New York).Crossref, Google Scholar
Bensoussan A (1992) Stochastic Control of Partially Observable Systems (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar
Berry DA, Fristedt B (1985) Bandit Problems, Sequential Allocation of Experiments (Chapman & Hall, New York).Crossref, Google Scholar
Bertsekas DP (2012) Approximate dynamic programming, vol. II. Dynamic Programming and Optimal Control, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
Bertsekas DP (2017) Dynamic Programming and Optimal Control, vol. I, 4th ed. (Athena Scientific, Belmont, MA).Google Scholar
Bertsimas D, Mersereau AJ (2007) A learning approach for interactive marketing to a customer segment. Oper. Res. 55(6):1120–1135.Link, Google Scholar
Bertsimas D, Perakis G (2006) Dynamic pricing: A learning approach. Hearn D, Lawphongpanich S, eds. Mathematical and Computational Models for Congestion Charging (Springer, Berlin), 45–79.Crossref, Google Scholar
Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407–1420.Link, Google Scholar
Bini DA, Fiorentino G (2000) Design, analysis, and implementation of a multiprecision polynomial rootfinder. Numerical Algorithms 23:127–173.Crossref, Google Scholar
Bini DA, Robol L (2014) Solving secular and polynomial equations: A multiprecision algorithm. J. Comput. Appl. Math. 272:276–292.Crossref, Google Scholar
Bolton P, Harris C (1999) Strategic experimentation. Econometrica 67:349–374.Crossref, Google Scholar
Bouneffouf D, Rish I, Aggarwal C (2020) Survey on applications of multi-armed and contextual bandits. Proc. IEEE Congress on Evolutionary Comput, CEC, Glasgow, UK.Google Scholar
Broder J, Rusmevichientong P (2012) Dynamic pricing under a general parametric choice model. Oper. Res. 60(4):965–980.Link, Google Scholar
Burnetas AN, Katehakis MN (1996) Optimal adaptive policies for sequential allocation problems. Adv. Appl. Math. 17(2):122–142.Crossref, Google Scholar
Caro F, Gallien J (2007) Dynamic assortment with demand learning for seasonal consumer goods. Management Sci. 53(2):276–292.Link, Google Scholar
Carvalho AX, Puterman ML (2005) Learning and pricing in an Internet environment with binomial demand. J. Revenue Pricing Management 3:320–336.Crossref, Google Scholar
den Boer AV (2015) Dynamic pricing and learning: Historical origins, current research, and new directions. Survey Oper. Res. Management Sci. 20:1–18.Crossref, Google Scholar
Easley D, Kiefer NM (1988) Controlling a stochastic process with unknown parameters. Econometrica 56:1045–1064.Crossref, Google Scholar
Elaydi S (2005) An Introduction to Difference Equations, 3rd ed. (Springer, New York).Google Scholar
Farias VF, Van Roy B (2010) Dynamic pricing with a prior on market response. Oper. Res. 58(1):16–29.Link, Google Scholar
Gittins JC (1979) Bandit processes and dynamic allocation indices. J. Royal Statist. Soc. B 14:148–177.Google Scholar
Harrison JM, Sunar N (2015) Investment timing with incomplete information and multiple means of learning. Oper. Res. 63(2):442–457.Link, Google Scholar
Harrison JM, Keskin NB, Zeevi A (2012) Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Sci. 58:570–586.Link, Google Scholar
Keller G, Rady S (1999) Optimal experimentation in a changing environment. Rev. Econom. Stud. 66:475–507.Crossref, Google Scholar
Keskin NB, Zeevi A (2014) Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Oper. Res. 62(5):1142–1167.Link, Google Scholar
Kwon HD, Lippman SA (2011) Acquisition of project-specific assets with Bayesian updating. Oper. Res. 59:1119–1130.Link, Google Scholar
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6:4–22.Crossref, Google Scholar
McLennan A (1984) Price dispersion and incomplete learning in the long run. J. Econom. Dynamic Control 7:331–347.Crossref, Google Scholar
Moscarini G, Smith L (2001) The optimal level of experimentation. Econometrica 69:1629–1644.Crossref, Google Scholar
Negoescu DM, Bimpikis K, Brandeau ML, Iancu DA (2018) Dynamic learning of patient response types: An application to treating chronic diseases. Management Sci. 64(8):3469–3488.Link, Google Scholar
Peskir G, Shiryaev AN (2000) Sequential testing problems for Poisson processes. Ann. Statist. 28:837–859.Crossref, Google Scholar
Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (New Series) 58:527–535.Crossref, Google Scholar
Romberg HF (1972) Continuous sequential testing of a Poisson process to minimize the Bayes risk. J. Amer. Statist. Assoc. 67:921–926.Crossref, Google Scholar
Rothschild M (1974) A two-armed bandit theory of market pricing. J. Econom. Theory 9:185–202.Crossref, Google Scholar
Saghafian S, Hopp WJ, Iravani SMR, Cheng Y, Diermeier D (2018) Workload management in telemedical physician triage and other knowledge-based service systems. Management Sci. 64(11):5180–5197.Link, Google Scholar
Shiryaev AN (1967) Two problems of sequential analysis. Cybernet. Systems Anal. 3:63–69.Crossref, Google Scholar
Smallwood RD, Sondik EJ (1973) The optimal control of partially observable Markov processes over a finite horizon. Oper. Res. 21:1071–1088.Link, Google Scholar
Sondik EJ (1978) The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Oper. Res. 26:282–304.Link, Google Scholar
Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction, 2nd ed. (MIT Press, Cambridge, MA).Google Scholar
Tartakovsky A, Nikiforov I, Basseville M (2014) Sequential Analysis: Hypothesis Testing and Changepoint Detection (Chapman and Hall/CRC, Boca Raton, FL).Crossref, Google Scholar
Tsoukalas A, Albertson T, Tagkopoulos I (2015) From data to optimal decision making: A data-driven, probabilistic machine learning approach to decision support for patients with sepsis. JMIR Medical Inform. 3(1):e11.Crossref, Google Scholar
Villar SS, Bowden J, Wason J (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Statist. Sci. 30(2):199–215.Crossref, Google Scholar
Wald A (1945) Sequential tests of statistical hypotheses. Ann. Math. Statist. 16:117–186.Crossref, Google Scholar
Wald A (1947) Foundations of a general theory of sequential decision functions. Econometrica 15:279–313.Crossref, Google Scholar
Wald A, Wolfowitz J (1948) Optimum character of the sequential probability ratio test. Ann. Math. Statist. 19:326–339.Crossref, Google Scholar
Wald A, Wolfowitz J (1950) Bayes solutions of sequential decision problems. Ann. Math. Statist. 21:82–99.Crossref, Google Scholar
Wang G, Xiong J, Zhang S (2016) Partially observable stochastic optimal control. Internat. J. Numerical Anal. Modeling 13(3):493–512.Google Scholar
Whitehead J (1997) The Design and Analysis of Sequential Clinical Trials, 2nd ed. (John Wiley & Sons, West Sussex, UK).Crossref, Google Scholar
William H (2009) Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research. Proc. National Acad. Sci. USA 106(52):22387–22392.Crossref, Google Scholar
Zhang H (2010) Partially observable Markov decision processes: A geometric technique and analysis. Oper. Res. 58:214–228.Link, Google Scholar
Zhang H (2021) Dynamic learning and decision making via basis weight vectors. Preprint, submitted September 1, 2020, https://dx.doi.org/10.2139/ssrn.3679048.Google Scholar
Zhang J, Denton BT, Balasubramanian H, Shah ND, Inman BA (2012) Optimization of prostate biopsy referral decisions. Manufacturing Service Oper. Management 14(4):529–547.Link, Google Scholar
Ziegler GM (1995) Lectures on Polytopes (Graduate Texts in Mathematics) (Springer-Verlag, New York).Crossref, Google Scholar

Volume 68, Issue 8

August 2022

Pages 5557-6354, iv-v

Article Information

Supplemental Material

Metrics

Information

Received:September 10, 2020
Accepted:August 12, 2021
Published Online:February 01, 2022

Cite as

Hao Zhang (2022) Analytical Solution to a Discrete-Time Model for Dynamic Learning and Decision Making. Management Science 68(8):5924-5957.

https://doi.org/10.1287/mnsc.2021.4194

Keywords

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Analytical Solution to a Discrete-Time Model for Dynamic Learning and Decision Making

References

Volume 68, Issue 8

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News