Free Access

Diversity Subsampling: Custom Subsamples from Large Data Sets

Boyang Shang
Boyang Shang
[email protected]
https://orcid.org/0000-0001-5379-6880
Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Daniel W. Apley
Corresponding Author
Daniel W. Apley
[email protected]
https://orcid.org/0000-0002-8545-4612
Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author
,
Sanjay Mehrotra
Sanjay Mehrotra
[email protected]
https://orcid.org/0000-0003-1106-1901
Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208
Search for more papers by this author

Boyang Shang

[email protected]

https://orcid.org/0000-0001-5379-6880

Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Daniel W. Apley

Corresponding Author

Daniel W. Apley

[email protected]

https://orcid.org/0000-0002-8545-4612

Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Sanjay Mehrotra

[email protected]

https://orcid.org/0000-0003-1106-1901

Industrial Engineering and Management Science, Northwestern University, Evanston, Illinois 60208

Search for more papers by this author

Published Online:22 Nov 2023https://doi.org/10.1287/ijds.2022.00017

References

Biyik E (2019) dpp sampler.py. Accessed March 26, 2020, https://github.com/Stanford-ILIAD/DPP-Batch-Active-Learning/blob/master/classification_synthetic/dpp_sampler.py.Google Scholar
Biyik E, Wang K, Anari N, Sadigh D (2019) Batch active learning using determinantal point processes. Preprint, submitted June 19, https://arxiv.org/abs/1906.07975.Google Scholar
Chen Y, Zhang N (2022) Density regression with conditional support points. Technometrics 64(3):1–13.Google Scholar
Cook RL (1986) Stochastic sampling in computer graphics. ACM Trans. Graphics 5(1):51–72.Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J. Royal Statist. Soc. B 39(1):1–22.Google Scholar
Fanaee-T H, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progress Artificial Intelligence 2:113–127.Google Scholar
Gelman A, Carlin JB, Stern HS, Rubin DB (1995) Bayesian Data Analysis (Chapman and Hall/CRC Press, Boca Raton, FL).Google Scholar
Han I, Gillenwater J (2020) Map inference for customized determinantal point processes via maximum inner product search. Chiappa S, Calandra R, eds. Proc. Internat. Conf. on Artificial Intelligence and Statist. (PMLR, New York), 2797–2807.Google Scholar
Haussmann E, Fenzi M, Chitta K, Ivanecky J, Xu H, Roy D, Mittel A, et al. (2020) Scalable active learning for object detection. Proc. IEEE Intelligent Vehicles Sympos. (IEEE, New York), 1430–1435.Google Scholar
Huang C, Joseph VR, Mak S (2022) Population quasi-Monte Carlo. J. Comput. Graphics Statist. 31(3):1–14.Google Scholar
Joseph VR, Mak S (2021) Supervised compression of big data. Statist. Analysis Data Mining 14(3):217–229.Google Scholar
Kennard RW, Stone LA (1969) Computer aided design of experiments. Technometrics 11(1):137–148.Google Scholar
Ko CW, Lee J, Queyranne M (1995) An exact algorithm for maximum entropy sampling. Oper. Res. 43(4):684–691.Link, Google Scholar
Mack Y, Rosenblatt M (1979) Multivariate k-nearest neighbor density estimates. J. Multivariate Anal. 9(1):1–15.Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Le Cam LM, Neyman J, eds. Proc. 5th Berkeley Sympos. on Math. Statist. and Probability, vol. 1 (University of California Press, Downtown Oakland, CA), 281–297.Google Scholar
Mak S, Joseph VR (2018) Support points. Ann. Statist. 46(6A):2562–2592.Google Scholar
McCool M, Fiume E (1992) Hierarchical poisson disk sampling distributions. Fiume E, ed. Proc. Conf. on Graphics Interface, vol. 92 (Canadian Information Processing Society, Mississauga, ON, Canada), 94–105.Google Scholar
Parzen E (1962) On estimation of a probability density function and mode. Ann. Math. Statist. 33(3):1065–1076.Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825–2830.Google Scholar
Puzyn T, Mostrag-Szlichtyng A, Gajewicz A, Skrzyński M, Worth AP (2011) Investigating the influence of data splitting on the predictive ability of qsar/qspr models. Structural Chemistry 22(4):795–804.Google Scholar
Ren P, Xiao Y, Chang X, Huang PY, Li Z, Gupta BB, Chen X, et al. (2021) A survey of deep active learning. ACM Comput. Survey 54(9):1–40.Google Scholar
Reynolds DA (2009) Gaussian mixture models. Encyclopedia Biometrics 741:659–663.Google Scholar
Rizzo M, Szekely G (2022) Energy: E-statistics: Multivariate inference via the energy of data. https://CRAN.R-project.org/package=energy.Google Scholar
Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27(3):832–837.Google Scholar
Rubin D (1987) A noniterative sampling/importance resampling alternative to data augmentation for creating a few imputations when fractions of missing information are modest: The sir algorithm. J. Amer. Statist. Assoc. 82:544–546.Google Scholar
Rubin DB (1988) Using the sir algorithm to simulate posterior distributions. Bayesian Statist. 3:395–402.Google Scholar
Shang B, Apley DW, Mehrotra S (2022) Fast diversity subsampling from a data set. Accessed June 2, 2022, https://pypi.org/project/FADS/.Google Scholar
Silveira AL, Barbeira PJS (2022) A fast and low-cost approach for the discrimination of commercial aged cachaças using synchronous fluorescence spectroscopy and multivariate classification. J. Sci. Food Agriculture 102(11):4918–4926.Google Scholar
Skare Ø, Bølviken E, Holden L (2003) Improved sampling-importance resampling and reduced bias importance sampling. Scandinavian J. Statist. 30(4):719–737.Google Scholar
Song D, Xi NM, Li JJ, Wang L (2022a) scsampler. Accessed February 9, 2022, https://github.com/SONGDONGYUAN1994/scsampler.Google Scholar
Song D, Xi NM, Li JJ, Wang L (2022b) scSampler: Fast diversity-preserving subsampling of large-scale single-cell transcriptomic data. Bioinformatics 38(11):3126–3127.Google Scholar
Székely G (2003) E-statistics: The energy of statistical samples. Technical report, Bowling Green State University, Department of Mathematics and Statistics, Bowling Green, Ohio.Google Scholar
Székely GJ, Rizzo ML (2004) Testing for equal distributions in high dimension. InterStat 5:1–6.Google Scholar
Terrell GR, Scott DW (1992) Variable kernel density estimation. Ann. Statist. 20(3):1236–1265.Google Scholar
Wang Z, Garrett CR, Kaelbling LP, Lozano-Pérez T (2018) Active model learning and diverse action sampling for task and motion planning. Maciejewski AA, ed. Proc. IEEE/RSJ Internat. Conf. on Intelligent Robots and Systems (IEEE, New York), 4107–4114.Google Scholar
Wu D (2018) Pool-based sequential active learning for regression. IEEE Trans. Neural Network Learn. Systems 30(5):1348–1359.Google Scholar
Yu H, Kim S (2010) Passive sampling for regression. Proc. IEEE Internat. Conf. on Data Mining (IEEE, New York), 1151–1156.Google Scholar
Yuksel C (2015) Sample elimination for generating poisson disk sample sets. Comput. Graphics Forum 34(2):25–32.Google Scholar
Yuksel C (2016) cysampleelim.h. Accessed September 17, 2020, https://github.com/cemyuksel/cyCodeBase/blob/master/cySampleElim.h.Google Scholar

cover image INFORMS Journal on Data Science

Volume 2, Issue 2

October-December 2023

Pages 99-217, C2

Article Information

Supplemental Material

Metrics

Information

Received:June 13, 2022
Accepted:September 20, 2023
Published Online:November 22, 2023

Cite as

Boyang Shang, Daniel W. Apley, Sanjay Mehrotra (2023) Diversity Subsampling: Custom Subsamples from Large Data Sets. INFORMS Journal on Data Science 2(2):161-182.

https://doi.org/10.1287/ijds.2022.00017

Keywords

PDF download

Available Issues

Available Issues

Diversity Subsampling: Custom Subsamples from Large Data Sets

References

Volume 2, Issue 2

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News