February 4, 2019 in Software Survey: Statistical Analysis

Statistics: Reflecting an uncertain world

Data collection and statistics has always been an integral part of operations research.

SHARE: PRINT ARTICLE:print this page https://doi.org/10.1287/orms.2019.01.12

In this era of big data, computer-intensive and statistically-based methods may be used to help us find suitable products, help us in emergencies or protect us from fraud.

The application of scientific methods to military operations in WWII came to be known as operations research (O.R.). This was one among the many applications of science in what was a war of technology. The needs were great, and any edge that could be obtained in such vital areas as air defense, improving the success of shipping to Great Britain, or the detection and destruction of attacking submarines would make an important contribution.

To be credible, the improvements would have to be quantifiable to convince military staffs of the need for change in doctrine and operations and justify the investment. Any analysis would have to be grounded in data to support their recommendations and later demonstrate their effectiveness. In its inception, O.R. had to be empirical and grounded in applications. Thus, it should be no surprise that statisticians such as Frank Yates and Abraham Wald were among the contributors in this new field.

In his remembrance of the wartime efforts, Philip Morse (1977 [1]) noted the importance that was placed on having technical people collect the technical data – to be able to determine the critical factors and the relevant data. This insistence on direct observations sometimes resulted in controversy. For instance, field O.R. teams confirmed that tactical air attacks under the direction of forward controllers were very effective in suppressing enemy vehicle movements and damaging enemy forces.

This also discovered that the number of direct damage (hits) by the tactical air forces in Normandy was far less than what was claimed by the air force. The terror of the attacks led to many vehicles and tanks being abandoned rather than destroyed. This ran counter to the dogma of the leaders at that time, who were at pains to establish an independent air force. Reassessments of the battle damage confirmed the initial ground assessments about the limited number of direct hits by aircraft.

Given the success of their work during the war, the proponents of operations research were confident in asserting that their methods of analysis should apply to diverse areas of operations in civilian life. From the beginning, analysts used the tools of probability and statistics: the former as the basis for modeling and prediction, and the latter as the methodology of summarizing, interpreting and quantifying uncertainty associated with the raw data. Statistics may be used to decide among possible probability models or estimate the parameters of the models.

Statistics remain an important part of applied O.R. as the basis for data collection and analysis, as well as the basis for experimentation needed to isolate critical factors affecting outcomes. Given the prevalence and fidelity of simulations, experimental design of simulated alternatives is an important area of application as well.

Some Statistical Challenges

Search was among the problems tackled by O.R. analysts in WWII. Whether optimizing a search pattern to locate downed fliers or enemy shipping, this was an early analysis based on probabilistic models combined with observations from the field. Variants of this type of problem are still of interest to policymakers and based on statistical analysis. For instance, among billions of financial transactions, how do you detect the rare fraudulent ones or those related to money laundering?

A census is an ancient example of statistics. The direct enumeration of a population and perhaps some aspects of that population, such as their property would have been a necessity for empires to assess their population and resources. Historically, such an enumeration might serve as the basis of taxation. Now, of course, census information is valuable for businesses and governments, and the accuracy of the count may be a huge issue for municipalities and states. Much operational planning is needed during the time of a modern census to ensure complete enumerations in spite of various problems such as transient populations and illegal aliens.

Some types of enumeration are not yet obtainable with reliability. Marches and protests are now often taken as indicators of popular support and potentially political strength. Direct enumeration is not feasible, of course, and estimates may be biased by whether the estimator is supportive or opposed to the crowd’s goal. Typical approaches may use rough estimates of the area that is used in conjunction with sampled crowd density values. The relative error may be no better than 10 percent. Variations of this problem include estimation of wildlife populations.

An even more challenging problem is to estimate populations that are hidden, such as attempting to determine the number of trafficking victims. The challenge is to infer the total numbers in a larger population from the number of victims that are actually detected.

Data Centric World

One characteristic of the modern era is the sheer amount of data being collected. We live in a data-intensive world. We track our progress in our fitness using a variety of devices and apps. Even a simple running app can tell us where we were, what our elevation was and how fast we were moving along the way. More advanced apps can track additional measures such as heart rate and breathing. Our cell phones track our location and other activities, with varying degrees of privacy. Our homes are increasingly shared with smart devices and assistants that observe our activities, answer our questions or play our music. All potentially available to interested parties to influence our behavior or anticipate our desires.

In this era of big data, computer-intensive and statistically-based methods may be used to help us find suitable products, help us in emergencies or protect us from fraud. Information obtained from cell phones can be used to predict traffic flow and signal when congestion occurs by observing the speed of cell phones associated with the roadways.

Big science is increasingly a matter of big data. As examples, particle physics and astronomy are increasingly data intensive. In both cases statistical methods are used to search for particular rare events such as star occlusions that might signify a planet orbiting a distant sun, or identify the decay products or particle track predicted by a revised theory of subatomic particles. In the case of the gravitational waves, Bayesian methods were used with collateral data to identify the region of space likely to have generated the source of the combined signals to pinpoint the source. 

Statistical Practice

As noted in the last survey (Swain, 2017 [2]), the practice of using p-values as the primary measure of significance has come into disrepute and several journals have taken steps to either reduce its importance or ban the use altogether. Too many significant results obtained in this way could not be reproduced by other experimenters. It is known, of course, that among a large number of experiments it is likely that some p-values will be small purely by chance, and this problem also occurs when there are many measures. False positives are particularly problematic in the pharmacy industry where small experiments are often the basis for selecting promising agents for further development and testing.

Experimental reproducibility is an increasing issue in fields outside of statistics, such as computations. The ideal is to provide (when possible) complete data and full accounting of all processing steps so that an experiment can be directly verified by others. 

Products

This year’s survey of products is an update of the survey published in 2017. The biennial statistical software products survey for this issue provides capsule information about 12 products selected from eight vendors as of the deadline. The tools range from general tools that cover the important techniques of inference and estimation, as well as specialized activities such as nonlinear regression, forecasting and design of experiments. The product information contained in the survey was obtained from product vendors and is summarized in online tables to highlight general features, capabilities and computing requirements, as well as contact information. Many of the vendors have their own websites for further, detailed information, and many provide demonstration programs that can be downloaded from these sites. No attempt is made to evaluate or rank the products, and the information provided comes from the vendors themselves. Vendors unable to make the publishing deadline will be added to the online survey. (See editor’s note below for URLs.)

Products that provide statistical add-ins available for use with spreadsheets remain popular and provide enhanced specialized capabilities for spreadsheets. The spreadsheet is the primary computational tool in a wide variety of settings, familiar and accessible to all. Many procedures of data summarization, estimation, inference, basic graphics and even regression modeling can be added to spreadsheets in this way. An example are the SigmaXL and XLSTAT add-ins for Microsoft Excel. These programs add functionality to spreadsheets.

Dedicated general and special purpose statistical software generally have a wider variety and depth of analysis than available in the add-in software. For many specialized techniques such as forecasting, design of experiments and so forth, a statistical package is appropriate. In general, statistical software plays a distinct role on the analyst’s desktop and provided the data can be freely exchanged among applications, each part of an analysis can be made with the most appropriate (or convenient) software tool.

An important feature of statistical programs is the importation of data from as many sources as possible, to eliminate the need for data entry when data are already available from another source. Most programs have the ability to read from spreadsheets and selected data storage formats.

References

  1. Philip M. Morse, 1977, “In at the Beginning: A Physicist’s Life,” MIT Press. 2. 2. James J. Swain, 2017, “The joys and perils of statistics,” OR/MS Today, Vol. 44, No. 1 (February 2017).

Editor’s note: A directory of vendors who participated in this year’s statistical analysis software survey is available online along with summarized tables of the collected data including tool descriptions and capabilities, pricing and general information (https://pubsonline.informs.org/magazine/orms-today/2019-statistical-analysis-software-survey). Vendors who did not respond by the deadline can fill out and return the survey questionnaire (https://www.surveymonkey.com/r/6N2DP28) and it will be added to the online listings.

James J. Swain
([email protected])

SHARE:

This article appears in INFORMS Analytics Collections Vol. 14: Harnessing Value Through Streaming Data Analytics.

Visit this collection for free access to more articles showcasing how streaming data from real-world complex systems are being analyzed.

INFORMS site uses cookies to store information on your computer. Some are essential to make our site work; Others help us improve the user experience. By using this site, you consent to the placement of these cookies. Please read our Privacy Statement to learn more.