Diversity Subsampling: Custom Subsamples from Large Data Sets

Published Online:https://doi.org/10.1287/ijds.2022.00017

References

  • Biyik E (2019) dpp sampler.py. Accessed March 26, 2020, https://github.com/Stanford-ILIAD/DPP-Batch-Active-Learning/blob/master/classification_synthetic/dpp_sampler.py.Google Scholar
  • Biyik E, Wang K, Anari N, Sadigh D (2019) Batch active learning using determinantal point processes. Preprint, submitted June 19, https://arxiv.org/abs/1906.07975.Google Scholar
  • Chen Y, Zhang N (2022) Density regression with conditional support points. Technometrics 64(3):1–13.Google Scholar
  • Cook RL (1986) Stochastic sampling in computer graphics. ACM Trans. Graphics 5(1):51–72.Google Scholar
  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J. Royal Statist. Soc. B 39(1):1–22.Google Scholar
  • Fanaee-T H, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progress Artificial Intelligence 2:113–127.Google Scholar
  • Gelman A, Carlin JB, Stern HS, Rubin DB (1995) Bayesian Data Analysis (Chapman and Hall/CRC Press, Boca Raton, FL).Google Scholar
  • Han I, Gillenwater J (2020) Map inference for customized determinantal point processes via maximum inner product search. Chiappa S, Calandra R, eds. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR, New York), 2797–2807.Google Scholar
  • Haussmann E, Fenzi M, Chitta K, Ivanecky J, Xu H, Roy D, Mittel A, et al. (2020) Scalable active learning for object detection. Proc. IEEE Intelligent Vehicles Sympos. (IEEE, New York), 1430–1435.Google Scholar
  • Huang C, Joseph VR, Mak S (2022) Population quasi-Monte Carlo. J. Comput. Graphics Statist. 31(3):1–14.Google Scholar
  • Joseph VR, Mak S (2021) Supervised compression of big data. Statist. Analysis Data Mining 14(3):217–229.Google Scholar
  • Kennard RW, Stone LA (1969) Computer aided design of experiments. Technometrics 11(1):137–148.Google Scholar
  • Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper. Res. 43(4):684–691.LinkGoogle Scholar
  • Mack Y, Rosenblatt M (1979) Multivariate k-nearest neighbor density estimates. J. Multivariate Anal. 9(1):1–15.Google Scholar
  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Le Cam LM, Neyman J, eds. Proc. 5th Berkeley Sympos. on Math. Statist. and Probability, vol. 1 (University of California Press, Downtown Oakland, CA), 281–297.Google Scholar
  • Mak S, Joseph VR (2018) Support points. Ann. Statist. 46(6A):2562–2592.Google Scholar
  • McCool M, Fiume E (1992) Hierarchical poisson disk sampling distributions. Fiume E, ed. Proc. Conf. on Graphics Interface, vol. 92 (Canadian Information Processing Society, Mississauga, ON, Canada), 94–105.Google Scholar
  • Parzen E (1962) On estimation of a probability density function and mode. Ann. Math. Statist. 33(3):1065–1076.Google Scholar
  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825–2830.Google Scholar
  • Puzyn T, Mostrag-Szlichtyng A, Gajewicz A, Skrzyński M, Worth AP (2011) Investigating the influence of data splitting on the predictive ability of qsar/qspr models. Structural Chemistry 22(4):795–804.Google Scholar
  • Ren P, Xiao Y, Chang X, Huang PY, Li Z, Gupta BB, Chen X, et al. (2021) A survey of deep active learning. ACM Comput. Survey 54(9):1–40.Google Scholar
  • Reynolds DA (2009) Gaussian mixture models. Encyclopedia Biometrics 741:659–663.Google Scholar
  • Rizzo M, Szekely G (2022) Energy: E-statistics: Multivariate inference via the energy of data. https://CRAN.R-project.org/package=energy.Google Scholar
  • Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27(3):832–837.Google Scholar
  • Rubin D (1987) A noniterative sampling/importance resampling alternative to data augmentation for creating a few imputations when fractions of missing information are modest: The sir algorithm. J. Amer. Statist. Assoc. 82:544–546.Google Scholar
  • Rubin DB (1988) Using the sir algorithm to simulate posterior distributions. Bayesian Statist. 3:395–402.Google Scholar
  • Shang B, Apley DW, Mehrotra S (2022) Fast diversity subsampling from a data set. Accessed June 2, 2022, https://pypi.org/project/FADS/.Google Scholar
  • Silveira AL, Barbeira PJS (2022) A fast and low-cost approach for the discrimination of commercial aged cachaças using synchronous fluorescence spectroscopy and multivariate classification. J. Sci. Food Agriculture 102(11):4918–4926.Google Scholar
  • Skare Ø, Bølviken E, Holden L (2003) Improved sampling-importance resampling and reduced bias importance sampling. Scandinavian J. Statist. 30(4):719–737.Google Scholar
  • Song D, Xi NM, Li JJ, Wang L (2022a) scsampler. Accessed February 9, 2022, https://github.com/SONGDONGYUAN1994/scsampler.Google Scholar
  • Song D, Xi NM, Li JJ, Wang L (2022b) scSampler: Fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. Bioinformatics 38(11):3126–3127.Google Scholar
  • Székely G (2003) E-statistics: The energy of statistical samples. Technical report, Bowling Green State University, Department of Mathematics and Statistics, Bowling Green, Ohio.Google Scholar
  • Székely GJ, Rizzo ML (2004) Testing for equal distributions in high dimension. InterStat 5:1–6.Google Scholar
  • Terrell GR, Scott DW (1992) Variable kernel density estimation. Ann. Statist. 20(3):1236–1265.Google Scholar
  • Wang Z, Garrett CR, Kaelbling LP, Lozano-Pérez T (2018) Active model learning and diverse action sampling for task and motion planning. Maciejewski AA, ed. Proc. IEEE/RSJ Internat. Conf. on Intelligent Robots and Systems (IEEE, New York), 4107–4114.Google Scholar
  • Wu D (2018) Pool-based sequential active learning for regression. IEEE Trans. Neural Network Learn. Systems 30(5):1348–1359.Google Scholar
  • Yu H, Kim S (2010) Passive sampling for regression. Proc. IEEE Internat. Conf. on Data Mining (IEEE, New York), 1151–1156.Google Scholar
  • Yuksel C (2015) Sample elimination for generating poisson disk sample sets. Comput. Graphics Forum 34(2):25–32.Google Scholar
  • Yuksel C (2016) cysampleelim.h. Accessed September 17, 2020, https://github.com/cemyuksel/cyCodeBase/blob/master/cySampleElim.h.Google Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.