This In-depth Tutorial on Data Mining Techniques Explains Algorithms, Data Mining Tools And Methods to Extract Useful Data:
In this In-Depth Data Mining Training Tutorials For All, we explored all about Data Mining in our previous tutorial.
In this tutorial, we will learn about the various techniques used for Data Extraction. As we know that data mining is a concept of extracting useful information from the vast amount of data, some techniques and methods are applied to large sets of data to extract useful information.
These techniques are basically in the form of methods and algorithms applied to data sets. Some of the data mining techniques include Mining Frequent Patterns, Associations & Correlations, Classifications, Clustering, Detection of Outliers, and some advanced techniques like Statistical, Visual and Audio data mining.
Generally, relational databases, transactional databases, and data warehouses are used for data mining techniques. However, there are also some advanced mining techniques for complex data such as time series, symbolic sequences, and biological sequential data.
What You Will Learn:
- Purpose Of Data Mining Techniques
- List Of Data Extraction Techniques
- Top Data Mining Algorithms
- Data Extraction Methods
- Top Data Mining Tools
- Recommended Reading
Purpose Of Data Mining Techniques
With a huge amount of data being stored each day, the businesses are now interested in finding out the trends from them. The data extraction techniques help in converting the raw data into useful knowledge. To mine huge amounts of data, the software is required as it is impossible for a human to manually go through the large volume of data.
A data mining software analyses the relationship between different items in large databases which can help in the decision-making process, learn more about customers, craft marketing strategies, increase sales and reduce the costs.
List Of Data Extraction Techniques
The data mining technique that is to be applied depends on the perspective of our Data analysis.
So let’s discuss the various techniques of how data extraction can be performed in different ways:
#1) Frequent Pattern Mining/Association Analysis
This type of data mining technique looks for recurring relationships in the given dataset. It will look for interesting associations and correlations between the different items in the database and identify a pattern.
An example, of such kind, would be “Shopping Basket Analysis”: finding out “which products the customers are likely to purchase together in the store?” such as bread and butter.
Application: Designing the placement of the products on store shelves, marketing, cross-selling of products.
The patterns can be represented in the form of association rules. The association rule says that support and confidence are the parameters to find out the usefulness of the associated items. The transactions which had both the items purchased together in one go is known as a support.
The transactions where the customers bought both the items but one after the other is confidence. The mined pattern would be considered interesting if it has a minimum support threshold and minimum confidence threshold value. The threshold values are decided by the domain experts.
Bread=> butter [support=2%, confidence-60%]
The above statement is an example of an association rule. This means that there is a 2% transaction that bought bread and butter together and there are 60% of customers who bought bread as well as butter.
Steps To Implement Association Analysis:
- Finding frequent itemsets. Itemset means a set of items. An itemset containing k items is a k-itemset. The frequency of an itemset is the number of transactions that contain the itemset.
- Generating strong association rules from the frequent itemsets. By strong association rules, we mean that the minimum threshold support and confidence is met.
There are various frequent itemset mining methods like Apriori Algorithm, Pattern Growth Approach, and Mining Using the Vertical Data Format. This technique is commonly known as Market Basket Analysis.
#2) Correlation Analysis
Correlation Analysis is just an extension of Association Rules. Sometimes the support and confidence parameters may still yield uninteresting patterns to the users.
An example supporting the above statement can be: out of 1000 transactions analyzed, 600 contained only bread, while 750 contained butter and 400 contained both bread and butter. Suppose the min support for association rule run is 30% and the minimum confidence is 60%.
The support value of 400/1000=40% and confidence value= 400/600= 66% meets the threshold. However, we see that the probability of purchasing butter is 75% which is more than 66%. This means that bread and butter are negatively correlated as the purchase of one would lead to a decrease in the purchase of the other. The results are deceiving.
From the above example, the support and confidence are supplemented with another interestingness measure i.e. correlation analysis which will help in mining interesting patterns.
A => B [support, confidence, correlation].
Correlation rule is measured by support, confidence and correlation between itemsets A and B. Correlation is measured by Lift and Chi-Square.
(i) Lift: As the word itself says, Lift represents the degree to which the presence of one itemset lifts the occurrence of other itemsets.
The lift between the occurrence of A and B can be measured by:
Lift (A, B) = P (A U B) / P (A). P (B).
If it is < 1, then A and B are negatively correlated.
If it is >1. Then A and B are positively correlated which means that the occurrence of one implies the occurrence of the other.
If it is = 1, then there is no correlation between them.
(ii) Chi-Square: This is another correlation measure. It measures the squared difference between the observed and expected value for a slot (A and B pair) divided by the expected value.
If it is >1, then it is negatively correlated.
Classification helps in building models of important data classes. A model or a classifier is constructed to predict the class labels. Labels are the defined classes with discrete values like “yes” or “no”, “safe” or “risky”. It is a type of supervised learning as the label class is already known.
Data Classification is a two-step process:
- Learning step: The model is constructed here. A pre-defined algorithm is applied to the data to analyze with a class label provided and the classification rules are constructed.
- Classification Step: The model is used to predict class labels for given data. The accuracy of the classification rules is estimated by the test data which if found accurate is used for classification of new data tuples.
The items in the itemset will be assigned to the target categories to predict functions at the class label level.
Application: Banks to identify loan applicants as low, medium or high risk, businesses designing marketing campaigns based on age group classification.`
#4) Decision Tree Induction
Decision Trees Induction method comes under the Classification Analysis. A decision tree is a tree-like structure that is easy to understand and simple & fast. In this, each non-leaf node represents a test on an attribute and each branch represents the outcome of the test, and the leaf node represents the class label.
The attribute values in a tuple are tested against the decision tree from the root to the leaf node. Decision trees are popular as it does not require any domain knowledge. These can represent multidimensional data. The decision trees can be easily converted to classification rules.
Application: The decision trees are constructed in medicine, manufacturing, production, astronomy, etc. An example can be seen below:
#5) Bayes Classification
Bayesian Classification is another method of Classification Analysis. Bayes Classifiers predict the probability of a given tuple to belong to a particular class. It is based on the Bayes theorem, which is based on probability and decision theory.
Bayes Classification works on posterior probability and prior probability for the decision-making process. By posterior probability, the hypothesis is made from the given information i.e. the attribute values are known, while for prior probability, the hypotheses are given regardless of the attribute values.
#6) Clustering Analysis
It is a technique of partitioning a set of data into clusters or groups of objects. The clustering is done using algorithms. It is a type of unsupervised learning as the label information is not known. Clustering methods identify data that are similar or different from each other, and analysis of characteristics is done.
Cluster analysis can be used as a pre-step for applying various other algorithms such as characterization, attribute subset selection, etc. Cluster Analysis can also be used for Outlier detection such as high purchases in credit card transactions.
Applications: Image recognition, web search, and security.
#7) Outlier Detection
The process of finding data objects which possess exceptional behavior from the other objects is called outlier detection. Outlier detection and cluster analysis are related to each other. Outlier methods are categorized into statistical, proximity-based, clustering-based and classification based.
There are different types of outliers, some of them are:
- Global Outlier: The data object deviated significantly from the rest of the data set.
- Contextual Outlier: It depends on certain factors like day, time, and location. If a data object deviates significantly with reference to a context.
- Collective Outlier: When a group of data objects has different behavior from the entire data set.
Application: Detection of credit card fraud risks, novelty detection, etc.
#8) Sequential Patterns
A trend or some consistent patterns are recognized in this type of data mining. Understanding customer purchase behavior and sequential patterns are used by the stores to display their products on shelves.
Application: E-commerce example where when you buy item A, it will show that Item B is often bought with Item A looking at the past purchasing history.
#9) Regression Analysis
This type of analysis is supervised and identifies which itemsets amongst the different relationships are related to or are independent of each other. It can predict sales, profit, temperature, forecast human behavior, etc. It has a data set value that is already known.
When an input is provided, the regression algorithm will compare the input and expected value, and the error is calculated to get to the accurate result.
Application: Marketing and Product Development Efforts comparison.
Top Data Mining Algorithms
Data Mining Techniques are applied through the algorithms behind it. These algorithms run on the data extraction software and are applied based on the business need.
Some of the algorithms that are widely used by organizations to analyze the data sets are defined below:
- K-means: It is a popular cluster analysis technique where a group of similar items is clustered together.
- Apriori Algorithm: It is a frequent itemset mining technique and association rules are applied to it on transactional databases. It will detect frequent itemsets and highlight general trends.
- K Nearest Neighbor: This method is used for classification and regression analysis. The k nearest neighbor is lazy learning where it stores the training data and when a new unlabeled data comes, it will classify the input data.
- Naves Bayes: It is a group of simple probabilistic classification algorithms which assume that each data object features are independent of the other. It is an application of Bayes Theorem.
- AdaBoost: It is a machine learning meta-algorithm, that is used to improve performance. Adaboost is sensitive to noisy data and outliers.
Data Extraction Methods
Some advanced Data Mining Methods for handling complex data types are explained below.
The data in today’s world is of varied types ranging from simple to complex data. To mine complex data types, such as Time Series, Multi-dimensional, Spatial, & Multi-media data, advanced algorithms and techniques are needed.
Some of them are described below:
- CLIQUE: It was the first clustering method to find the clusters in a multidimensional subspace.
- P3C: It is a well-known clustering method for moderate to high multidimensional data.
- LAC: It is a k-means based method aimed at clustering moderate to high dimensionality data. The algorithm partitions the data into k disjoint set of elements, by removing the possible outliers.
- CURLER: It is a correlation clustering algorithm, it spots both linear and non-linear correlations.
Top Data Mining Tools
Data Mining Tools are software used to mine data. The tools run algorithms at the backend. These tools are available in the market as Open Source, Free Software, and Licensed version.
Some of the Data Extraction Tools include:
RapidMiner is an open-source software platform for analytics teams that unites data prep, machine learning, and predictive model deployment. This tool is used for conducting data mining analysis and creating data models. It has large sets for classification, clustering, association rule mining, and regression algorithms.
It is an open-source tool containing data visualization and analysis package. Orange can be imported in any working python environment. It is well suited for new researchers and small projects.
KEEL (Knowledge Extraction based on Evolutionary Learning) is an open-source (GPLv3) Java software tool that can be used for a large number of different knowledge data discovery tasks.
IBM SPSS Modeler is a data mining and text analytics software application from IBM. It is used to build predictive models and conduct other analytic tasks.
It is a free and open-source tool containing Data Cleaning and Analysis Package, Specialized algorithms in the areas of Sentiment Analysis and Social Network Analysis. KNIME can integrate data from various sources in the same analysis. It has an interface with Java, Python and R Programming.
Important Question: How is Classification different from Prediction?
Classification is a grouping of data. Example of Classification is grouping based on age group, medical condition, etc. While prediction is deriving an outcome using the classified data.
An example of Predictive Analysis is predicting the interests based on age group, treatment for a medical condition. Prediction is also known as Estimation for continuous values.
Important Term: Predictive Data Mining
Predictive Data Mining is done to forecast or predict certain data trends using business intelligence and other data. It helps businesses have better analytics and make better decisions. Predictive Analytics is often combined with Predictive Data Mining.
The Predictive Data Mining finds out the relevant data for analysis. Predictive analytics uses data to forecast the outcome.
In this tutorial, we have discussed the various data mining techniques that can help organizations and businesses find the most useful and relevant information. This information is used to create models that will predict the behavior of customers for the businesses to act on it.
Reading all the above-mentioned information about the data mining techniques, one can determine its credibility and feasibility even better. Data extraction techniques include working with data, reformatting data, restructuring of data. The format of the information needed is based upon the technique and the analysis to be done.
Finally, all the techniques, methods and data mining systems help in the discovery of new creative innovations.