October 7, 2019 in Analytics
Beyond Cross Industry Standard Process for Data Mining
Three ways to refine CRISP-DM: an idealized and modern data science workstream
SHARE: PRINT ARTICLE:
https://doi.org/10.1287/LYTX.2019.06.04
Data people come in all shapes and sizes; the modern industry analytics professional may have had his or her foundational training as a research physicist, might have cut his or her teeth optimizing ad revenue and “click” count, or may have a traditional statistics or actuarial background [1]. Surprisingly, the recent explosion in availability of data tools and expertise has not yet been accompanied by the development of firmer, contemporary frameworks for “doing” data science. Should we wait for a random search to evolve an optimal algorithm for conducting the modern data science workstream? Or are older frameworks such as CRISP-DM (Cross Industry Standard Process for Data Mining) [2] optimal and sufficiently useful models?
Probably neither. For example, while a strength of CRISP-DM has been its emphasis on the evolving cycle from business understanding to data understanding to analytic deployment, a big weakness is arguably the overly conceptual and high-level nature of the framework. Where CRISP-DM fits a world of large, occasional “data pulls” and summary models and analyses, it says little about the fundamental challenge of organizing a modern product or internal analytics codebase. The contemporary analytics professional inhabits a world not only of SQL, but of tools such as Python, R, GitHub, Jenkins and Docker.
The CRISP-DM framework also fails to sufficiently emphasize the importance of setting criteria for success at the get-go of a project. Too many data science projects are stalled for not having a thorough idea of where one iteration of product development should end. Instead, they are left to meander with their plethora of tools through the data understanding, data preparation and modeling phases, wasting long stretches of time and large chunks of company investment. Ask any seasoned Kaggler or industry data scientist what their favorite part of the CRISP-DM process is, and they will likely describe something between the late data preparation step to the early stages of evaluation. Thus, alongside the great value to be gained in the incremental improvement of these stages of model building, we believe there is a substantial value to be gained from the higher-level refinement of how data scientists actually navigate this core part of the workstream.
Refine Standard
We propose that the standard process be refined in three ways. First, the macro-level process of defining the solution, deployment strategy and metrics of success (i.e., the business understanding step) should be ingrained in the workstream as a whole, and not thought of solely as the “boardroom” phase of a project. Business or deployment needs may change during the course of a project; what remains constant is the need to always find agreement on, and be aware of, the current “path to success.” While every project will obviously need a starting point and will have different stakeholder needs, it is crucial not to artificially isolate these decisions from the people and processes central to the solution (i.e., the modelers and the model building).
Second, solutions to problems can always be improved, as is the benefit and curse of innovation. The first major milestone on the path to success for any given project should be defining the minimum viable product. In the spirit of Occam’s razor, the simple solution will be preferred over the more complex, as this will serve to maximize explainability and usability. For example, there is no need to conduct complex data and feature transformations until the simplest model has been tested. This adjustment to the framework would also need to occur when considering the deployment strategy and the metrics for success.
Third, we suggest that a more precisely defined, micro-level process model is needed for the actual “doing” of a data science project. Not only does this schema need to be rigorous and hands-on (i.e., not passive or purely conceptual in nature), but it should be clearly modular, easily implemented and extended in code. A goal here should be to define an idealized workstream, of which there are likely many possible variants. To do so, one must fundamentally come to some agreement upon the order of the key modules or tasks. For example, many larger data sets are simply unnavigable without a comprehensive degree of initial data preparation or pre-cleaning. Thus, the data understanding of the CRISP-DM model seems to be placed in a logically flawed position (i.e., prior to data preparation).
Instead, we propose that initial ETL work with data sets be followed by a cleaning-exploring-modeling workstream. Each of these tasks or modules have clear objectives, are logically ordered relative to neighboring modules, and are largely independent in scope (although, of course, may be refined in unison across iterations). Having neatly modularized these core steps in a project codebase, a data team can much more easily complete the iterative hardening or “pushing” of code to production.
As things stand, process models for doing modern data science are scarce, and those that exist (e.g., CRISP-DM) fail to consider the nuts-and-bolts coding work in detail. The image of miraculously prepared and well-understood data being lumped together into a model is not far off the current idealized view. This is not helped by the typical industry descriptions of modeling, where a great deal more attention is often paid to raw performance ability, as compared to other, more qualitative and arguably more important benefits of different types of models (e.g., deployability).
Of course, this works when a company is trying to sell a product that nobody understands, but it does not create a constructive environment for developing substantive innovations. Once the proper “path to success” is identified and refined alongside development of individual modules, a more elaborate exploration of data can be performed, leading to much more meaningful results and impactful business understanding.
References
Stuart Jackson is a Ph.D.-trained data scientist with more than 10 years of experience working with advanced analytics in both high-end research and industry domains. He is a former IBMer and Fellow of Insight Data Science. Sean Hegarty is a data scientist with seven years of experience applying machine learning and AI solutions to a variety of domains. He currently works with IBM Watson Health to apply these techniques in healthcare.