This Tutorial on Data Mining Process Covers Data Mining Models, Steps and Challenges Involved in the Data Extraction Process:
Data Mining, which is also known as Knowledge Discovery in Databases is a process of discovering useful information from large volumes of data stored in databases and data warehouses. This analysis is done for decision-making processes in the companies.
Data Mining is carried using various techniques such as clustering, association, and sequential pattern analysis & decision tree.
What You Will Learn:
- What Is Data Mining?
- Data Extraction As A Process
- Data Mining Models
- Steps In The Data Mining Process
- Data Mining Process In Oracle DBMS
- Data Mining Process In Datawarehouse
- What Are The Applications of Data Extraction?
- Data Mining Challenges
- Recommended Reading
What Is Data Mining?
Data Mining is a process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the web, and other information repositories or data that are streamed into the system dynamically.
Why Do Businesses Need Data Extraction?
With the advent of Big Data, data mining has become more prevalent. Big data is extremely large sets of data that can be analyzed by computers to reveal certain patterns, associations, and trends that can be understood by humans. Big data has extensive information about varied types and varied content.
Thus with this amount of data, simple statistics with manual intervention would not work. This need is fulfilled by the data mining process. This leads to change from simple data statistics to complex data mining algorithms.
The data mining process will extract relevant information from raw data such as transactions, photos, videos, flat files and automatically process the information to generate reports useful for businesses to take action.
Thus, the data mining process is crucial for businesses to make better decisions by discovering patterns & trends in data, summarizing the data and taking out relevant information.
Data Extraction As A Process
Any business problem will examine the raw data to build a model that will describe the information and bring out the reports to be used by the business. Building a model from data sources and data formats is an iterative process as the raw data is available in many different sources and many forms.
Data is increasing day by day, hence when a new data source is found, it can change the results.
Below is the outline of the process.
Data Mining Models
Many industries such as manufacturing, marketing, chemical, and aerospace are taking advantage of data mining. Thus the demand for standard and reliable data mining processes is increased drastically.
The important data mining models include:
#1) Cross-Industry Standard Process for Data Mining (CRISP-DM)
CRISP-DM is a reliable data mining model consisting of six phases. It is a cyclical process that provides a structured approach to the data mining process. The six phases can be implemented in any order but it would sometimes require backtracking to the previous steps and repetition of actions.
The six phases of CRISP-DM include:
#1) Business Understanding: In this step, the goals of the businesses are set and the important factors that will help in achieving the goal are discovered.
#2) Data Understanding: This step will collect the whole data and populate the data in the tool (if using any tool). The data is listed with its data source, location, how it is acquired and if any issue encountered. Data is visualized and queried to check its completeness.
#3) Data Preparation: This step involves selecting the appropriate data, cleaning, constructing attributes from data, integrating data from multiple databases.
#4) Modeling: Selection of the data mining technique such as decision-tree, generate test design for evaluating the selected model, building models from the dataset and assessing the built model with experts to discuss the result is done in this step.
#5) Evaluation: This step will determine the degree to which the resulting model meets the business requirements. Evaluation can be done by testing the model on real applications. The model is reviewed for any mistakes or steps that should be repeated.
#6) Deployment: In this step a deployment plan is made, strategy to monitor and maintain the data mining model results to check for its usefulness is formed, final reports are made and review of the whole process is done to check any mistake and see if any step is repeated.
#2) SEMMA (Sample, Explore, Modify, Model, Assess)
SEMMA is another data mining methodology developed by SAS Institute. The acronym SEMMA stands for sample, explore, modify, model, assess.
SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the significant predicted variables, create a model using the variables to come out with the result, and check its accuracy. SEMMA is also driven by a highly iterative cycle.
Steps in SEMMA
- Sample: In this step, a large dataset is extracted and a sample that represents the full data is taken out. Sampling will reduce the computational costs and processing time.
- Explore: The data is explored for any outlier and anomalies for a better understanding of the data. The data is visually checked to find out the trends and groupings.
- Modify: In this step, manipulation of data such as grouping, and subgrouping is done by keeping in focus the model to be built.
- Model: Based on the explorations and modifications, the models that explain the patterns in data are constructed.
- Assess: The usefulness and reliability of the constructed model are assessed in this step. Testing of the model against real data is done here.
Both the SEMMA and CRISP approach work for the Knowledge Discovery Process. Once models are built, they are deployed for businesses and research work.
Steps In The Data Mining Process
The data mining process is divided into two parts i.e. Data Preprocessing and Data Mining. Data Preprocessing involves data cleaning, data integration, data reduction, and data transformation. The data mining part performs data mining, pattern evaluation and knowledge representation of data.
Why do we preprocess the data?
There are many factors that determine the usefulness of data such as accuracy, completeness, consistency, timeliness. The data has to quality if it satisfies the intended purpose. Thus preprocessing is crucial in the data mining process. The major steps involved in data preprocessing are explained below.
#1) Data Cleaning
Data cleaning is the first step in data mining. It holds importance as dirty data if used directly in mining can cause confusion in procedures and produce inaccurate results.
Basically, this step involves the removal of noisy or incomplete data from the collection. Many methods that generally clean data by itself are available but they are not robust.
This step carries out the routine cleaning work by:
(i) Fill The Missing Data:
Missing data can be filled by methods such as:
- Ignoring the tuple.
- Filling the missing value manually.
- Use the measure of central tendency, median or
- Filling in the most probable value.
(ii) Remove The Noisy Data: Random error is called noisy data.
Methods to remove noise are :
Binning: Binning methods are applied by sorting values into buckets or bins. Smoothening is performed by consulting the neighboring values.
Binning is done by smoothing by bin i.e. each bin is replaced by the mean of the bin. Smoothing by a median, where each bin value is replaced by a bin median. Smoothing by bin boundaries i.e. The minimum and maximum values in the bin are bin boundaries and each bin value is replaced by the closest boundary value.
- Identifying the Outliers
- Resolving Inconsistencies
#2) Data Integration
When multiple heterogeneous data sources such as databases, data cubes or files are combined for analysis, this process is called data integration. This can help in improving the accuracy and speed of the data mining process.
Different databases have different naming conventions of variables, by causing redundancies in the databases. Additional Data Cleaning can be performed to remove the redundancies and inconsistencies from the data integration without affecting the reliability of data.
Data Integration can be performed using Data Migration Tools such as Oracle Data Service Integrator and Microsoft SQL etc.
#3) Data Reduction
This technique is applied to obtain relevant data for analysis from the collection of data. The size of the representation is much smaller in volume while maintaining integrity. Data Reduction is performed using methods such as Naive Bayes, Decision Trees, Neural network, etc.
Some strategies of data reduction are:
- Dimensionality Reduction: Reducing the number of attributes in the dataset.
- Numerosity Reduction: Replacing the original data volume by smaller forms of data representation.
- Data Compression: Compressed representation of the original data.
#4) Data Transformation
In this process, data is transformed into a form suitable for the data mining process. Data is consolidated so that the mining process is more efficient and the patterns are easier to understand. Data Transformation involves Data Mapping and code generation process.
Strategies for data transformation are:
- Smoothing: Removing noise from data using clustering, regression techniques, etc.
- Aggregation: Summary operations are applied to data.
- Normalization: Scaling of data to fall within a smaller range.
- Discretization: Raw values of numeric data are replaced by intervals. For Example, Age.
#5) Data Mining
Data Mining is a process to identify interesting patterns and knowledge from a large amount of data. In these steps, intelligent patterns are applied to extract the data patterns. The data is represented in the form of patterns and models are structured using classification and clustering techniques.
#6) Pattern Evaluation
This step involves identifying interesting patterns representing the knowledge based on interestingness measures. Data summarization and visualization methods are used to make the data understandable by the user.
#7) Knowledge Representation
Knowledge representation is a step where data visualization and knowledge representation tools are used to represent the mined data. Data is visualized in the form of reports, tables, etc.
Data Mining Process In Oracle DBMS
RDBMS represents data in the form of tables with rows and columns. Data can be accessed by writing database queries.
Relational Database management systems such as Oracle support Data mining using CRISP-DM. The facilities of the Oracle database are useful in data preparation and understanding. Oracle supports data mining through java interface, PL/SQL interface, automated data mining, SQL functions, and graphical user interfaces.
Data Mining Process In Datawarehouse
A data warehouse is modeled for a multidimensional data structure called data cube. Each cell in a data cube stores the value of some aggregate measures.
Data mining in multidimensional space carried out in OLAP style (Online Analytical Processing) where it allows exploration of multiple combinations of dimensions at varying levels of granularity.
What Are The Applications of Data Extraction?
List of areas where data mining is widely used includes:
#1) Financial Data Analysis: Data Mining is widely used in banking, investment, credit services, mortgage, automobile loans, and insurance & stock investment services. The data collected from these sources is complete, reliable and is of high quality. This facilitates systematic data analysis and data mining.
#2) Retail and Telecommunication Industries: Retail Sector collects huge amounts of data on sales, customer shopping history, goods transportation, consumption, and service. Retail data mining helps to identify customer buying behaviors, customer shopping patterns, and trends, improve the quality of customer service, better customer retention, and satisfaction.
#3) Science and Engineering: Data mining computer science and engineering can help to monitor system status, improve system performance, isolate software bugs, detect software plagiarism, and recognize system malfunctions.
#4) Intrusion Detection and Prevention: Intrusion is defined as any set of actions that threaten the integrity, confidentiality or availability of network resources. Data mining methods can help in intrusion detection and prevention system to enhance its performance.
#5) Recommender Systems: Recommender systems help consumers by making product recommendations that are of interest to users.
Data Mining Challenges
Enlisted below are the various challenges involved in Data Mining.
- Data Mining needs large databases and data collection that are difficult to manage.
- The data mining process requires domain experts that are again difficult to find.
- Integration from heterogeneous databases is a complex process.
- The organizational level practices need to be modified to use the data mining results. Restructuring the process requires effort and cost.
Data Mining is an iterative process where the mining process can be refined, and new data can be integrated to get more efficient results. Data Mining meets the requirement of effective, scalable and flexible data analysis.
It can be considered as a natural evaluation of information technology. As a knowledge discovery process, Data preparation and data mining tasks complete the data mining process.
Data mining processes can be performed on any kind of data such as database data and advanced databases such as time series etc. The data mining process comes with its own challenges as well.
Stay tuned to our upcoming tutorial to know more about Data Mining Examples!!