Data Mining: Process, Techniques & Major Issues In Data Analysis

This In-depth Data Mining Tutorial Explains What Is Data Mining, Including Processes And Techniques Used For Data Analysis:

Let us understand the meaning of the term mining by taking the example of mining of gold from rocks, which is called gold mining. Here the useful thing is “Gold”, hence it is called gold mining.

Similarly taking out useful information from a vast amount of data is termed as Knowledge mining, and is popularly known as Data Mining. By the term useful information, we denote the data which can help us in predicting an output.

For example finding the trends of purchasing a particular thing (say iron) by a particular age group (Example: 40-70 years).

=> SCROLL DOWN to see the entire list of 7 In-Depth Data Mining Tutorials for Beginners

Table of Contents:

List Of Data Mining Tutorials
Overview Of Tutorials In This Data Mining Series
What Is Data Mining?
- Data Analysis Process
What Kinds Of Data Can Be Mined?
What Techniques Are Used In Data Mining?
Major Issues In Data Analysis
Conclusion

List Of Data Mining Tutorials

Tutorial #1: Data Mining: Process, Techniques & Major Issues In Data Analysis (This Tutorial)
Tutorial #2: Data Mining Techniques: Algorithm, Methods & Top Data Mining Tools
Tutorial #3: Data Mining Process: Models, Process Steps & Challenges Involved
Tutorial #4: Data Mining Examples: Most Common Applications Of Data Mining 2019
Tutorial #5: Decision Tree Algorithm Examples In Data Mining
Tutorial #6: Apriori Algorithm In Data Mining: Implementation With Examples
Tutorial #7: Frequent Pattern (FP) Growth Algorithm In Data Mining

Overview Of Tutorials In This Data Mining Series

Tutorial #	What You Will Learn
Tutorial_#1:	Data Mining: Process, Techniques & Major Issues In Data Analysis This In-depth Data Mining Tutorial explains What is Data Mining, including the Processes And Techniques used for Data Analysis.
Tutorial_#2:	Data Mining Techniques: Algorithm, Methods & Top Data Mining Tools This Tutorial on Data Mining Techniques explains Algorithms, Data Mining Tools and Methods to Extract Useful Data.
Tutorial_#3:	Data Mining Process: Models, Process Steps & Challenges Involved This Tutorial on Data Mining Process Covers Data Mining Models, Steps and Challenges Involved in the Data Extraction Process.
Tutorial_#4:	Data Mining Examples: Most Common Applications Of Data Mining 2019 Most Popular Data Mining Examples in Real Life are covered in this Tutorial. You will get to know more about Data Mining Application in Finance, Marketing, Healthcare, and CRM.
Tutorial_#5:	Decision Tree Algorithm Examples In Data Mining This In-depth Tutorial explains all about Decision Tree Algorithm in Data Mining. You will Learn About Decision Tree Examples, Algorithm & Classification.
Tutorial_#6:	Apriori Algorithm In Data Mining: Implementation With Examples This is a Simple Tutorial on Apriori Algorithm to find out Frequent Itemsets in Data Mining. You will also get to know the Steps in Apriori and understand How it Works.
Tutorial_#7:	Frequent Pattern (FP) Growth Algorithm In Data Mining This is a Detailed Tutorial on Frequent Pattern Growth Algorithm which represents the Database in the form a FP Tree. FP Growth Vs Apriori Comparison is also explained here.

What Is Data Mining?

Data Mining is in big demand today as it helps the businesses study how the sales of their products can increase. We can understand this by an example of a fashion store, which will register each of its customer who purchases an item from their store.

Based on the data given by the customer such as age, gender, income group, profession, etc., the store will be able to find out which type of customers buy different products. Here, we can see that the name of the customer is of no use as we cannot predict the trend of purchase by name as to whether that person will buy a certain product or not.

Thus the useful information can be found out using the age group, gender, income group, profession, etc. Searching for knowledge or interesting pattern in data is “Data Mining”. Other terms that can be used in place are Knowledge Mining from data, Knowledge Extraction, Data Analysis, Pattern Analysis, etc.

Another term that is popularly used in data mining is Knowledge Discovery from Data or KDD.

Data Analysis Process

The knowledge discovery process is a sequence of the following steps:

Data Cleaning: This step removes noise and inconsistent data from the input data.
Data Integration: This step combines multiple sources of data. The data cleaning and data integration step together to form the preprocessing of data. The preprocessed data is then stored in the data warehouse.
Data Selection: These steps select the data to the analysis task from the database.
Data Transformation: In this step, various data aggregation and data summary techniques are applied to transform the data into a useful form for mining.
Data Mining: In this step, data patterns are extracted by applying intelligent methods.
Pattern Evaluation: The extracted data patterns are evaluated and recognized according to the interestingness measures.
Knowledge Representation: Visualization and knowledge representation techniques are used to present the mined knowledge to the users.

The steps 1 to 4 come under the data preprocessing stage. Here, data mining is represented as a single step but it refers to the entire knowledge discovery process.

Thus, we can say, that data analysis is the process of discovering interesting patterns and knowledge from a large amount of data. The data sources can include databases, data warehouses, World Wide Web, flat files and other informative files.

What Kinds Of Data Can Be Mined?

The most basic forms of data for mining are database data, data warehouse data, and transactional data. The data mining techniques can also be applied to other forms like data streams, sequenced data, text data, and spatial data.

#1) Database Data: The database management system is a set of interrelated data and a set of software programs to manage and access the data. The relational database system is a collection of tables and each table consists of a set of attributes and tuples.

Mining of relational databases search the trends and data patterns E.g. credit risk of customers based on age, income, and previous credit risk. Also, mining can find out deviations from the expected E.g. a significant increase in the price of an item.

#2) Data Warehouse Data: A data warehouse is a collection of information collected from multiple data sources, stored under a unified schema at a single sit. A DW is modeled as a multidimensional data structure called data cube having cells and dimensions providing precomputation and faster access to data.

Data mining is performed in an OLAP style by combining the dimensions at varying levels of granularity.

#3) Transactional Data: Transactional Data captures a transaction. It has a transaction id and a list of items used in transaction.

#4) Other kinds of Data: Other data can include: time-related data, spatial data, hypertext data, and multimedia data.

What Techniques Are Used In Data Mining?

Data Mining is a highly application-driven domain. Many techniques such as statistics, machine learning, pattern recognition, information retrieval, visualization, etc., influence the development of data analysis methods.

Let’s discuss some of them here!!

Statistics

The study of collection, analysis, interpretation, and presentation of data can be done using Statistical Models. For example, statistics can be used to model noise and missing data, and then this model can be used in large data set to identify the noise and missing values in data.

Machine Learning

ML is used to improve performance based on data. The main research area is for computer programs to automatically learn to recognize complex patterns and make intelligent decisions based on the data.

Machine Learning focuses on accuracy and data mining focuses on the efficiency and scalability of mining methods on the large data set, complex data, etc.

Machine learning is of three types:

Supervised Learning: The target data set is known and the machine is trained according to the target values.
Unsupervised Learning: The target values are not known and the machines learn by themselves.
Semi-Supervised Learning: It uses both the techniques of supervised and unsupervised learning.

Information Retrieval (IR)

It is the science of searching for documents or information in documents.

It uses two principles:

Data that is to be searched is unstructured.
The queries are formed mainly by keywords.

By using data analysis and IR, we can find major topics in the collection of documents and also the major topics involved in each document.

Major Issues In Data Analysis

Data Mining has a number of issues related to it as mentioned below:

Mining Methodology

As there are diverse applications, new mining tasks continue to emerge. These tasks can use the same database in different ways and require the development of new data mining techniques.
While searching for knowledge in large datasets, we need to explore multidimensional space. To find interesting patterns, various combinations of dimensions need to be applied.
Uncertain, noisy and incomplete data can sometimes lead to erroneous derivation.

User Interaction

The data analyzing process should be highly interactive. It is important for facilitating the mining process to be user interactive.
The domain knowledge, background knowledge, constraints, etc., should all be incorporated in the data mining process.
The knowledge discovered by mining the data should be usable for humans. The system should adopt an expressive representation of knowledge, user-friendly visualization techniques, etc.

Efficiency And Scalability

Data mining algorithms should be efficient and scalable to effectively extract interesting data from a huge amount of data in the data repositories.
Wide distribution of data, complexity in computation motivates the development of parallel and distributed data-intensive algorithms.

Diversity of Database Types

The construction of effective and efficient data analysis tools for diverse applications, wide spectrum of data types from unstructured data, temporal data, hypertext, multimedia data, and software program code remains a challenging and active area of research.

Social Impact

The disclosure to use the data and the potential violation of individual privacy and protection of rights are the areas of concern that need to be addressed.

Conclusion

Data Mining helps in decision making and analysis of a large amount of data. Nowadays it is the most common business technique. It allows automatic analysis of data and identifies popular trends and behavior.

Data Analysis can be combined with machine learning, statistics, artificial intelligence, etc., for advanced data analysis and behavior study.

Data Mining should be applied by taking into consideration various factors such as cost of extracting information and pattern from databases (complex algorithms which require expert resources need to be applied), type of information (as historical data may not be the same as what it is in present, so the analysis will not be useful).

We hope this tutorial enriched your knowledge of the concept of Data Mining!!

NEXT Tutorial

Was this helpful?

Thanks for your feedback!