Research Note—Generating Shareable Statistical Databases for Business Value: Multiple Imputation with Multimodal Perturbation

Published Online:https://doi.org/10.1287/isre.1110.0361

References

  • Abowd J., Woodcock S. D., Doyle P., Lane J., Theeuwes J. J. M., Zayatz L. V. Disclosure limitation in longitudinal linked data. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies (2001) (North-Holland, Amsterdam) 215–278Google Scholar
  • Abowd J., Stinson M., Benedetto G. Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. (2006) . http://www.hks.harvard.edu/inequality/seminar/papers/Abowd07.pdfGoogle Scholar
  • Acquisti A., Gross R. Predicting social security numbers from public data. Proc. National Acad. Sci. (2009) 106(27):10975–10980CrossrefGoogle Scholar
  • Berry M., Linoff G.Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management (2004) (John Wiley & Sons, New York) Google Scholar
  • Burridge J. Information preserving statistical obfuscation. Statist. Comput. (2003) 13(4):321–327CrossrefGoogle Scholar
  • Carlson M., Salabasis M. A data swapping technique for generating synthetic samples: A method for disclosure control. Res. Official Statist. (2002) 6:35–64Google Scholar
  • Carroll R., Ruppert D., Stefanski L. A., Crainiceanu C.Measurement Error in Nonlinear Models: A Modern Perspective (2006) 2nd ed.(Chapman & Hall/CRC, Boca Raton, FL) CrossrefGoogle Scholar
  • Dalenius T., Reiss S. P. Data-swapping: A technique for disclosure control. J. Statist. Planning Inference (1982) 6(1):73–85CrossrefGoogle Scholar
  • Davenport T. H., Harris J. G., Jones G. L., Lemon K. N., Norton D. The dark side of customer analytics. Harvard Bus. Rev. (2007) 85(5):37–48Google Scholar
  • Domingo-Ferrer J., Mateo-Sanz J. M. Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowledge Data Engrg. (2002) 14(1):189–201CrossrefGoogle Scholar
  • Domingo-Ferrer J., Mateo-Sanz J. M., Torra V. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. Proc. ETK-NTTS (2001) (Eurostat, Luxembourg) 807–825Google Scholar
  • Fearon J. Primary commodity exports and civil war. J. Conflict Resolution (2004) 49(4):483–507CrossrefGoogle Scholar
  • Fienberg S. E., McIntyre J., Domingo-Ferrer J., Torra V. Data swapping: Variations on a theme. Privacy in Statistical Databases (2004) (Springer, Berlin/Heidelberg) CrossrefGoogle Scholar
  • Fuller W. Masking procedures for microdata disclosure limitation. J. Official Statist (1993) 9(2):383–406Google Scholar
  • Garfinkel S., Miller R. Johnny 2: A user test of key continuity management with S/MIME and outlook express. SOUPS '05: 2005 Sympos. Usable Privacy and Security (2005) Pittsburgh:13–24CrossrefGoogle Scholar
  • Garfinkel R., Gopal R., Thompson S. Releasing individually identifiable microdata with privacy protection against stochastic threat: An application to health information. Inform. Systems Res. (2007) 18(1):23–41LinkGoogle Scholar
  • Gaw S., Felten E., Fernandex-Kelly P. Secrecy, flagging, and paranoia, adoption criteria in encrypted email. CHI '06: SIGHI Conf. Human Factors Comput. Systems (2006) Montréal, Québec, Canada:591–600CrossrefGoogle Scholar
  • Graham P., Penny R. Multiply imputed synthetic data files. Official Statist. Res. Ser. (2007) 1:1–45Google Scholar
  • Henna J. Marginal distributions of finite mixtures of multivariate normal distributions. J. Japan Statist. Soc. (2001) 31(2):187–191CrossrefGoogle Scholar
  • Hevner A., March S., Park J. Design science in information systems research. MIS Quart. (2004) 28(1):75–105CrossrefGoogle Scholar
  • Homer N., Szelinger S., Redman M., Duggan D., Tembe W., Muehling J., Pearson J. V., Stephan D. A., Nelson S. F., Craig D. W. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics (2008) 4(8):1–9CrossrefGoogle Scholar
  • Hopke P., Liu C., Rubin D. B. Multiple imputation for multivariate data with missing and below-threshold measurements: Time-series concentrations of pollutants in the Arctic. Biometrics (2001) 57(1):22–33CrossrefGoogle Scholar
  • Hsu C.-H., Taylor J., Murray S., Commenges D. Survival analysis using auxiliary variables via non-parametric multiple imputation. Statist. Medicine (2006) 25:3503–3517CrossrefGoogle Scholar
  • Kinney S. K., Reiter J. P. Making public use, synthetic files of longitudinal establishment data. Proc. Internat. Comparative Anal. Enterprise (Micro) Data Conf. (2006) Chicago:1–10Google Scholar
  • Knudsen E., Linden A., Ergon T., Jonzen N., Vik J. O., Knape J., Roer J. E., Stenseth N. Characterizing bird migration phenology using data from standardized monitoring at bird observatories. Climate Res. (2007) 35(1–2):59–77CrossrefGoogle Scholar
  • Lee S., Genton M. G., Arellano-Valle R. B. Perturbation of numerical confidential data via skew-t distributions. Management Sci. (2010) 56(2):318–333LinkGoogle Scholar
  • Li X.-B., Sarkar S. Privacy protection in data mining: A perturbation approach for categorical data. Inform. System Res. (2006) 17(3):254–270LinkGoogle Scholar
  • Liew C. K., Choi U. J., Lic C. J. A data distribution by probability distribution. ACM Trans. Database Systems (1985) 10(3):395–411CrossrefGoogle Scholar
  • Linoff G., Berry M.Mining the Web: Transforming Customer Data into Customer Value (2002) (John Wiley & Sons, New York) Google Scholar
  • Menon S., Sarkar S., Mukherjee S. Maximizing accuracy of shared databases when concealing sensitive patterns. Inform. Systems Res. (2005) 16(3):256–270LinkGoogle Scholar
  • Moore R. A. Controlled data swapping for masking public use microdata sets. (1996) . 96/04. U.S. Census Bureau Research Report, 1–27. http://www.census.gov/srd/papers/pdf/rr99-4.pdfGoogle Scholar
  • Muralidhar K., Sarathy R. Data shuffling—A new masking approach for numerical data. Management Sci. (2006) 52(5):658–670LinkGoogle Scholar
  • Muralidhar K., Sarathy R., Dandekar R., Domingo-Ferrer J., Franconi L. Why swap when you can shuffle? A comparison of the proximity swap and data shuffle for numeric data. Privacy in Statistical Databases (2006) (Springer Verlag, Berlin) 164–176CrossrefGoogle Scholar
  • Narayanan A., Shmatikov V. How to break anonymity of the Netflix prize data set. Comput. Sci. (2006) . http://arxiv.org/abs/cs/0610105v2Google Scholar
  • Narayanan A., Shmatikov V. Myths and fallacies of “Personally Identifiable Information”. Comm. ACM (2010) 53(6):24–26CrossrefGoogle Scholar
  • NRCAccess to Research Data in the 21st Century: An Ongoing Dialogue Among Interested Parties Report of a Workshop (2002) (National Academy Press, Washington, DC) Google Scholar
  • Paass G. Disclosure risk and disclosure avoidance for microdata. J. Bus. Econom. Statist. (1988) 6:487–500Google Scholar
  • Perloff J., Denbaly M. Data needs for consumer and retail firm studies. Annual Meeting of Amer. Agricultural Econom. Assoc. (2007) Portland, ORCrossrefGoogle Scholar
  • Raghunathan T. E., Reiter J. P., Rubin D. B. Multiple imputation for statistical disclosure limitation. J. Official Statist. (2003) 19(1):1–16Google Scholar
  • Reiss S. P. Practical data-swapping: The first steps. Proc. IEEE Sympos. Security and Privacy (1980) (IEEE, Piscataway, NJ) 38–43CrossrefGoogle Scholar
  • Reiter J. P. Satisfying disclosure restrictions with synthetic data sets. J. Official Statist. (2002) 18(4):531–543Google Scholar
  • Reiter J. P. Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Statist. Soc. (2005) 168(1):185–205CrossrefGoogle Scholar
  • Rubin D. B. The Bayesian bootstrap. Ann. Statist. (1981) 9(1):130–134CrossrefGoogle Scholar
  • Rubin D. B. Discussion: Statistical disclosure limitation. J. Official Statist. (1993) 9(2):461–468Google Scholar
  • Rubin D. B., Schenker N. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J. Amer. Statist. Assoc. (1986) 81(394):366–374CrossrefGoogle Scholar
  • Saar-Tsechansky M., Provost F. Handling missing values when applying classification models. J. Maching Learn. Res. (2007) 8:1625–1657Google Scholar
  • Schafer J. L.Analysis of Incomplete Multivariate Data (1997) (Chapman & Hall, London) CrossrefGoogle Scholar
  • Skinner C. J. On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica (1992) 46(1):21–32CrossrefGoogle Scholar
  • Sweeney L., Doyle P., Lane J., Theeuwes J. J. M., Zayatz L. V. Information explosion. Confidentiality, Disclosure, and Data Access (2001) (Urban Institute, Washington DC) 43–74Google Scholar
  • Templ M. sdcMicro: A new flexible R-package for the generation of anonymized microdata: Design issues and new methods. (2006) . http://cran.r-project.org/web/packages/sdcmicro/vignettes/sdcmicropaper.pdfGoogle Scholar
  • Tendick P., Matloff N. A modified random perturbation method for database security. ACM Trans. Database Systems (1994) 19(1):47–63CrossrefGoogle Scholar
  • Whitten A., Tygar J. Why Johnny can't encrypt: A usability evaluation of PGP 5.0. SSYM '99: 8th Conf. USENIX Security (1999) (USENIX Association, Berkeley, CA) 14–14Google Scholar
  • Wood S. N.Generalized Additive Models: An Introduction with R (2006) (Chapman & Hall/CRC, London) CrossrefGoogle Scholar
  • Zheng Z., Padmanabhan B. Selectively acquiring customer information: A new data acquisition problem and an active-learning based solution. Management Sci. (2006) 52(5):697–712LinkGoogle Scholar
INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.