These data miners don't want to be without the full set of methods that they have used earlier in their careers. For one thing, many of today's data miners began in more traditional data analyst roles, and were familiar with classical statistics before they entered data mining. Times change, and so have the attitudes of the data mining community. To eliminate any data from the working dataset was to risk losing treasured insights. This image was so powerful that it inspired the name for a whole field of study-data mining. And then, there was the idea of finding a valuable and elusive bit of information in a mass of data. As computers and computer memory became less expensive, it seemed that sampling was a waste of time. If the objective of data mining was to give business people the power to make discoveries from data independently, then it made sense to reduce the number of steps in any way possible. There was a time when one of the most popular concepts in data mining was to put an end to sampling. Why sample? When you have plentiful data, a powerful computer and equally powerful software, why not use every bit of that? This might involve sampling or balancing (a special kind of sampling) or both, but should always be thoughtful. The very first task you will need to do when data mining is to determine the size and nature of the data subset that you will be working with. Some of the hard work that follows will be inspired by what you uncover using these recipes. Many of these recipes could be done immediately after accessing your data for the first time.
Since the recipes are orientated around software tasks, there is a particular focus on exploring and data quality. In this chapter we will introduce some of the IBM SPSS Modeler nodes associated with these tasks as well as nodes that one might associate with other phases, but that can prove useful during data understanding. The CRISP-DM document covers the initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.ĬRISP-DM lists the following tasks as a part of the data understanding phase: If you are new to data mining please do read the business understanding section first (refer Appendix, Business Understanding), and consider reading the CRISP-DM document in its entirety as it will place our recipes in a broader context. However, since this book is focused on specific software tasks and recipes, and since business understanding is conducted in the meeting room, not alone at one's laptop, our discussion of this phase is placed in a special section of the book. It is certainly a candidate for the phase that is most rushed, albeit rushed at the peril of the data mining project. Some would argue, including the authors of this book, that business understanding is the phase in most need of more attention by new data miners. Business understanding is a critical phase. This opening chapter is regarding data understanding, but this phase is not the first phase of CRISP-DM. Go beyond the basics and get the full power of your data mining workbench with this practical guide. Master the best methods for building models that will perform well in the workplace. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. By reading this book, you are learning from practitioners who have helped define the state of the art.įollow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data.
IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork.