Data mining is finding useful relationships in large datasets. "When you mine data (by "drilling down"), you use data to improve your business by predicting and understanding behavior." (Peter Frometa, SPSS Inc., 2001)
According to a press release, "in May 1998, more than 20 key players in the data mining market met to discuss the first draft of a new process model, CRISP-DM ("CRoss-Industry Standard Process for Data Mining"). This is designed to help businesses plan and work through the complete data mining process - from problem specification to deployment of results. The core consortium consists of NCR, ISL, Daimler-Benz and OHRA. At the centre of the CRISP-DM project is a Special Interest Group (SIG) of data mining service suppliers and large-scale commercial users."
Data mining employs a 6-stage approach to extracting meaning from business data. This parallels Rasch-based approaches to measurement construction in the social sciences. The Table below focusses on the Data Cleaning component of data mining. It is in marked contrast to the conventional "data is inviolable" approach of social science research.
The Figure shows the six phases of a data mining process. The sequence of the phases is not rigid. Moving back and forth between different phases is always required. It depends on the outcome of each phase which phase or which particular task of a phase, has to be performed next. The arrows indicate the most important and frequent dependencies between phases.
The outer circle symbolizes the cyclical nature of data mining itself. Data mining is not over once a solution is deployed. The lessons learned during the process, and from the deployed solution, can trigger new, often more focused business questions. Subsequent data mining processes will benefit from the experiences of previous ones. In the following, we outline each phase briefly:
1. Business understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, then
converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the
objectives.
2. Data understanding
starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data
quality problems, to discover first insights into the data or to detect interesting subsets with hidden information.
3. Data preparation
constructs the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times
and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning
of data for modeling tools.
4. Modeling
selects and applies modeling techniques and calibrates their parameters to optimal values. Typically, there are several
techniques for the same data mining problem type. Some techniques have specific requirements on the form of data.
Therefore, stepping back to the data preparation phase is often necessary.
5. Evaluation
thoroughly reviews the model and the steps executed to construct the model to be certain it properly achieves the
business objectives. A key objective is to determine if there is some important business issue that has not been
sufficiently considered. A decision on the use of the data mining results should be reached.
6. Deployment
organizes and presents the knowledge gained in a way that the customer can use it. It often involves applying "live"
models within an organization's decision making processes, for example in real-time personalization of Web pages or
repeated scoring of marketing databases. However, depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In
many cases it is the customer, not the data analyst, who carries out the deployment steps. However, even if the analyst
will not carry out the deployment effort it is important for the customer to understand up-front what actions need to be
carried out in order to actually make use of the created models.
Excerpted from CRISP-DM 1.0 Step-by-step data mining guide (2000)
"Data Cleaning" | |
Task | Clean data Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by data modeling. |
Output | Data cleaning report Describe the decisions and actions that were taken to address the data quality problems. The report should also address what data quality issues are still outstanding and what possible effects they could have on the results. |
Activities | Reconsider how to deal with observed types of noise. Correct, remove or ignore noise. Decide how to deal with special values and their meaning. Reconsider data selection criteria in light of experiences of data cleaning (i.e., one may wish include/exclude other sets of data). |
Good Idea! | Remember that some fields may be irrelevant to the data mining goals and therefore noise in those fields has no significance. However, if noise is ignored for these reasons, it should be fully documented as the circumstances may change later! |
Excerpted from CRISP-DM 1.0 Step-by-step data mining guide (2000) |
1. Business Understanding Determine business objectives Determine data mining goals "You must have a clear idea of what success would be." |
1. Conceptualize the latent variable What to measure? How to do it? What marks success? |
2. Data Understanding "Do the data match your objectives?" |
2. Collect relevant data |
3. Data Preparation Select data Clean data Reconstruct data |
3. Organize data Select data Orient data Rescore data |
4. Modeling Build model Assess model |
4. Construct measures Select measurement model Explicable data fit? Refine model |
5. Evaluation Evaluate results Review process "Can results be repeated and verified by someone else?" |
5. Evaluate results Meaningful construct? Useful measures? Reproducible results? |
6. Deployment Report "Communicate! Impress! Compel!" Activate |
6. Utilize measures Reporting Decision making Knowledge building |
Data Mining and Rasch Measurement CRISP-DM, Linacre J.M. Rasch Measurement Transactions, 2001, 15:2 p. 826-7
Forum | Rasch Measurement Forum to discuss any Rasch-related topic |
Go to Top of Page
Go to index of all Rasch Measurement Transactions
AERA members: Join the Rasch Measurement SIG and receive the printed version of RMT
Some back issues of RMT are available as bound volumes
Subscribe to Journal of Applied Measurement
Go to Institute for Objective Measurement Home Page. The Rasch Measurement SIG (AERA) thanks the Institute for Objective Measurement for inviting the publication of Rasch Measurement Transactions on the Institute's website, www.rasch.org.
Coming Rasch-related Events | |
---|---|
Oct. 4 - Nov. 8, 2024, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
Jan. 17 - Feb. 21, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
May 16 - June 20, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
June 20 - July 18, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Further Topics (E. Smith, Facets), www.statistics.com |
Oct. 3 - Nov. 7, 2025, Fri.-Fri. | On-line workshop: Rasch Measurement - Core Topics (E. Smith, Winsteps), www.statistics.com |
The URL of this page is www.rasch.org/rmt/rmt152f.htm
Website: www.rasch.org/rmt/contents.htm