Cost-Restricted Feature Selection for Data Acquisition

Xiaoping Liu
Xiaoping Liu
[email protected]
D’Amore-McKim School of Business, Northeastern University, Boston, Massachusetts 02115;
Search for more papers by this author
,
Xiao-Bai Li
Xiao-Bai Li
[email protected]
https://orcid.org/0000-0001-8009-8439
Department of Operations and Information Systems, Manning School of Business, University of Massachusetts Lowell, Lowell, Massachusetts 01854;
Search for more papers by this author
,
Sumit Sarkar
Corresponding Author
Sumit Sarkar
[email protected]
https://orcid.org/0000-0003-3045-1024
Naveen Jindal School of Management, University of Texas at Dallas, Richardson, Texas 75080
Search for more papers by this author

D’Amore-McKim School of Business, Northeastern University, Boston, Massachusetts 02115;

Department of Operations and Information Systems, Manning School of Business, University of Massachusetts Lowell, Lowell, Massachusetts 01854;

Search for more papers by this author

Sumit Sarkar

Corresponding Author

Sumit Sarkar

[email protected]

https://orcid.org/0000-0003-3045-1024

Naveen Jindal School of Management, University of Texas at Dallas, Richardson, Texas 75080

Search for more papers by this author

Published Online:29 Sep 2022https://doi.org/10.1287/mnsc.2022.4551

References

Agresti A (2002) Categorical Data Analysis (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar
Aiken LS, West SG, Reno RR (1991) Multiple Regression: Testing and Interpreting Interactions (Sage, Thousand Oaks, CA).Google Scholar
Berndt ER (1991) The Practice of Econometrics (Addison-Wesley, New York).Google Scholar
Bhattacharyya S (1999) Direct marketing performance modeling using genetic algorithms. INFORMS J. Comput. 11(3):248–257.Link, Google Scholar
Bolón-Canedo V, Porto-Díaz I, Sánchez-Maroño N, Alonso-Betanzos A (2014) A framework for cost-based feature selection. Pattern Recognition 47(7):2481–2489.Crossref, Google Scholar
Breiman L (2001) Random forests. Machine Learn. 45(1):5–32.Crossref, Google Scholar
Bult JR, Wansbeek T (1995) Optimal selection for direct mail. Marketing Sci. 14(4):378–394.Link, Google Scholar
Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: From big data to big impact. Management Inform. Systems Quart. 36(4):1165–1188.Crossref, Google Scholar
Davenport TH (2006) Competing on analytics. Harvard Bus. Rev. 84(1):99–107.Google Scholar
Deng K, Zheng Y, Bourke C, Scott S, Masciale J (2013) New algorithms for budgeted learning. Machine Learn. 90(1):59–90.Crossref, Google Scholar
DirectMail.com (2020) Mailing list pricing. Accessed July 10, 2020, https://www.directmail.com/mailinglists/mailing-list-pricing.Google Scholar
Federal Trade Commission (2014) Data brokers: A call for transparency and accountability. Accessed July 10, 2020, http://www.ftc.gov/system/files/documents/reports/data-brokers-call-transparency-accountability-report-federal-trade-commission-may-2014/140527databrokerreport.pdf.Google Scholar
Gaines BR, Kim J, Zhou H (2018) Algorithms for fitting the constrained Lasso. J. Computational Graphical Statist. 27(4):861–871.Crossref, Google Scholar
Genuer R, Poggi JM, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognition Lett. 31(14):2225–2236.Crossref, Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J. Machine Learn. Res. 3:1157–1182.Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York).Crossref, Google Scholar
Hong S-H (2013) Measuring the effect of Napster on recorded music sales: Difference-in-differences estimates under compositional changes. J. Appl. Econometrics 28(2):297–324.Crossref, Google Scholar
Jaccard J, Turrisi R (2003) Interaction Effects in Multiple Regression (Sage, Thousand Oaks, CA).Crossref, Google Scholar
James GM, Paulson C, Rusmevichientong P (2020) Penalized and constrained regression: An application to high-dimensional website advertising. J. Amer. Statist. Assoc. 115(529):107–122.Crossref, Google Scholar
Jensen R, Shen Q (2007) Fuzzy-rough sets assisted attribute selection. IEEE Trans. Fuzzy Systems 15(1):73–89.Crossref, Google Scholar
Kim YS, Street WN, Russell GJ, Menczer F (2005) Customer targeting: A neural network approach guided by genetic algorithms. Management Sci. 51(2):264–276.Link, Google Scholar
Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22.Google Scholar
Meier L, Van De Geer S, Bühlmann P (2008) The group Lasso for logistic regression. J. Roy. Statist. Soc. Ser. B Statist. Methodology 70(1):53–71.Crossref, Google Scholar
Min F, He H, Qian Y, Zhu W (2011) Test-cost-sensitive attribute reduction. Inform. Sci. 181(22):4928–4942.Crossref, Google Scholar
Molnar C (2020) Interpretable machine learning. Accessed September 18, 2020, https://christophm.github.io/interpretable-ml-book/simple.html.Google Scholar
Moro S, Laureano R, Cortez P (2011) Using data mining for bank direct marketing: An application of the CRISP-DM methodology. Novais P, Machado J, Analide C, Abelha A, eds. Proc. Eur. Simulation Model. Conf. (EUROSIS, Ostend, Belgium), 117–121.Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12(85):2825–2830.Google Scholar
Ratanamahatana CA, Gunopulos D (2003) Feature selection for the naive Bayesian classifier using decision trees. Appl. Artificial Intelligence 17(5–6):475–487.Crossref, Google Scholar
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1(5):206–215.Crossref, Google Scholar
Saar-Tsechansky M, Melville P, Provost F (2009) Active feature-value acquisition. Management Sci. 55(4):664–684.Link, Google Scholar
Sakar CO, Polat SO, Katircioglu M, Kastro Y (2019) Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Comput. Appl. 31:6893–6908.Crossref, Google Scholar
Steel E (2013) Companies scramble for consumer data. Financial Times Online (June 12), https://www.ft.com/content/f0b6edc0-d342-11e2-b3ff-00144feab7de.Google Scholar
Steel E, Locke C, Cadman E, Freese B (2013) How much is your personal data worth? Financial Times Online (June 12), https://ig.ft.com/how-much-is-your-personal-data-worth/.Google Scholar
Sugumaran V, Muralidharan V, Ramachandran KI (2007) Feature selection using decision tree and classification through proximal support vector machine for fault diagnostics of roller bearing. Mech. Systems Signal Processing 21(2):930–942.Crossref, Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. B 58(1):267–288.Crossref, Google Scholar
Tibshirani R, Taylor J (2011) The solution path of the generalized Lasso. Ann. Statist. 39(3):1335–1371.Crossref, Google Scholar
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused Lasso. J. Roy. Statist. Soc. Ser. B Statist. Methodology 67(1):91–108.Crossref, Google Scholar
Tillmanns S, Hofstede FT, Krafft M, Goetz O (2017) How to separate the wheat from the chaff: Improved variable selection for new customer acquisition. J. Marketing 81(2):99–113.Crossref, Google Scholar
Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Lett. 28(4):459–471.Crossref, Google Scholar
Wedel M, Kannan PK (2016) Marketing analytics for data-rich environments. J. Marketing 80(6):97–121.Crossref, Google Scholar
Yu G, Witten D, Bien J (2020) Controlling costs: Feature selection on a budget. Preprint, submitted October 8, 2019, https://arxiv.org/abs/1910.03627.Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B Statist. Methodology 68(1):49–67.Crossref, Google Scholar
Zhang Y, Gong DW, Cheng J (2017a) Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Trans. Comput. Biol. Bioinformatics 14(1):64–75.Crossref, Google Scholar
Zhang Y, Song XF, Gong DW (2017b) A return-cost-based binary firefly algorithm for feature selection. Inform. Sci. 418:561–574.Crossref, Google Scholar
Zhu X, Wu X (2005) Cost-constrained data acquisition for intelligent data preparation. IEEE Trans. Knowledge Data Engrg. 17(11):1542–1556.Crossref, Google Scholar
Ziarko W (1993) Variable precision rough set model. J. Comput. System Sci. 46(1):39–59.Crossref, Google Scholar
Zou H (2006) The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101(476):1418–1429.Crossref, Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B Statist. Methodology 67(2):301–320.Crossref, Google Scholar

Volume 69, Issue 7

July 2023

Pages 3759-4361, iii-iv

Article Information

Supplemental Material

Metrics

Information

Received:December 16, 2019
Accepted:March 02, 2022
Published Online:September 29, 2022

Cite as

Xiaoping Liu, Xiao-Bai Li, Sumit Sarkar (2022) Cost-Restricted Feature Selection for Data Acquisition. Management Science 69(7):3976-3992.

https://doi.org/10.1287/mnsc.2022.4551

Keywords

Acknowledgments

The authors are grateful to the department editor, associate editor, and three anonymous reviewers for their insightful comments and suggestions that have improved the paper considerably.

PDF download

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Available Issues

Cost-Restricted Feature Selection for Data Acquisition

References

Volume 69, Issue 7

Article Information

Supplemental Material

Metrics

Information

Cite as

Keywords

Sign Up for INFORMS Publications Updates and News