This tutorial explains how to perform Data Visualization, K-means Cluster Analysis, and Association Rule Mining using WEKA Explorer:
In the Previous tutorial, we learned about WEKA Dataset, Classifier, and J48 Algorithm for Decision Tree.
As we have seen before, WEKA is an open-source data mining tool used by many researchers and students to perform many machine learning tasks. The users can also build their machine learning methods and perform experiments on sample datasets provided in the WEKA directory.
Data visualization in WEKA can be performed using sample datasets or user-made datasets in .arff,.csv format.
Association Rule Mining is performed using the Apriori algorithm. It is the only algorithm provided by WEKA to perform frequent pattern mining.
There are many algorithms present in WEKA to perform Cluster Analysis such as FartherestFirst, FilteredCluster, and HierachicalCluster, etc. Out of these, we will use SimpleKmeans, which is the simplest method of clustering.
What You Will Learn:
- Association Rule Mining Using WEKA Explorer
- K-means Algorithm Using WEKA Explorer
- Implement Data Visualization Using WEKA
Association Rule Mining Using WEKA Explorer
Let us see how to implement Association Rule Mining using WEKA Explorer.
Association Rule Mining
It is developed and designed by Srikant and Aggarwal in 1994. It helps us find patterns in the data. It is a data mining process that finds features which occur together or features that are correlated.
Applications of association rules include Market Basket Analysis, to analyze the items purchased in a single basket; Cross Marketing, to work with other businesses which increases our business product value such as vehicle dealer and Oil Company.
Association rules are mined out after frequent itemsets in a big dataset are found. These datasets are found out using mining algorithms such as Apriori and FP Growth. Frequent Itemset mining mines data using support and confidence measures.
Support And Confidence
Support measures the probability that two items are purchased together in a single transaction such as bread and butter. Confidence is a measure that states the probability that two items are purchased one after the other but not together such as laptop and computer antivirus software.
Minimum threshold support and minimum threshold confidence values are assumed to prune the transactions and find out the most frequently occurring itemset.
Implementation Using WEKA Explorer
WEKA contains an implementation of the Apriori algorithm for learning association rules. Apriori works only with binary attributes, categorical data (nominal data) so, if the data set contains any numerical values convert them into nominal first.
Apriori finds out all rules with minimum support and confidence threshold.
Follow the steps below:
#1) Prepare an excel file dataset and name it as “apriori.csv“.
#2) Open WEKA Explorer and under Preprocess tab choose “apriori.csv” file.
#3) The file now gets loaded in the WEKA Explorer.
#4) Remove the Transaction field by checking the checkbox and clicking on Remove as shown in the image below. Now save the file as “aprioritest.arff”.
#5) Go to the Associate tab. The apriori rules can be mined from here.
#6) Click on Choose to set the support and confidence parameters. The various parameters that can be set here are:
- “lowerBoundMinSupport” and “upperBoundMinSupport”, this is the support level interval in which our algorithm will work.
- Delta is the increment in the support. In this case, 0.05 is the increment of support from 0.1 to 1.
- metricType can be “Confidence”, “Lift”, “Leverage” and “Conviction”. This tells us how we rank the association rules. Generally, Confidence is chosen.
- numRules tells the number of association rules to be mined. By default, it is set as 10.
- significanceLevel depicts what is the significance of the confidence level.
#7) The Textbox next to choose button, shows the “Apriori-N-10-T-0-C-0.9-D 0.05-U1.0-M0.1-S-1.0-c-1”, which depicts the summarized rules set for the algorithm in the settings tab.
#8) Click on Start Button. The association rules are generated in the right panel. This panel consists of 2 sections. First is the algorithm, dataset chosen to run. The second part shows the Apriori Information.
Let us understand the run information in the right panel:
- Scheme used us Apriori.
- Instances and Attributes: It has 6 instances and 4 attributes.
- Minimum support and minimum confidence are 0.4 and 0.9 respectively. Out of 6 instances, 2 instances are found with min support,
- Number of cycles performed for the mining association rule is 12.
- The large itemsets generated are 3: L (1), L (2), L (3) but these are not ranked as their sizes are 7, 11, and 5 respectively.
- Rules found are ranked. The interpretation of these rules are as follows:
- Butter T 4 => Beer F 4: means out of 6, 4 instances show that for butter true, beer is false. This gives a strong association. Confidence level is 0.1.
The association rules can be mined out using WEKA Explorer with Apriori Algorithm. This algorithm can be applied to all types of datasets available in the WEKA directory as well as other datasets made by the user. The support and confidence and other parameters can be set using the Setting window of the algorithm.
K-means Algorithm Using WEKA Explorer
Let us see how to implement the K-means algorithm for clustering using WEKA Explorer.
What Is Cluster Analysis
Clustering Algorithms are unsupervised learning algorithms used to create groups of data with similar characteristics. It aggregates objects with similarities into groups and subgroups thus leading to the partitioning of datasets. Cluster analysis is the process of portioning of datasets into subsets. These subsets are called clusters and the set of clusters is called clustering.
Cluster Analysis is used in many applications such as image recognition, pattern recognition, web search, and security, in business intelligence such as the grouping of customers with similar likings.
What Is K-means Clustering
K means clustering is the simplest clustering algorithm. In the K-Clustering algorithm, the dataset is partitioned into K-clusters. An objective function is used to find the quality of partitions so that similar objects are in one cluster and dissimilar objects in other groups.
In this method, the centroid of a cluster is found to represent a cluster. The centroid is taken as the center of the cluster which is calculated as the mean value of points within the cluster. Now the quality of clustering is found by measuring the Euclidean distance between the point and center. This distance should be maximum.
How Does K-Mean Clustering Algorithm Work
Step #1: Choose a value of K where K is the number of clusters.
Step #2: Iterate each point and assign the cluster which is having the nearest center to it. When each element is iterated then compute the centroid of all the clusters.
Step #3: Iterate every element from the dataset and calculate the Euclidean distance between the point and the centroid of every cluster. If any point is present in the cluster which is not nearest to it then reassign that point to the nearest cluster and after performing this to all the points in the dataset, again calculate the centroid of each cluster.
Step #4: Perform Step#3 until there is no new assignment that took place between the two consecutive iterations.
K-means Clustering Implementation Using WEKA
The steps for implementation using Weka are as follows:
#1) Open WEKA Explorer and click on Open File in the Preprocess tab. Choose dataset “vote.arff”.
#2) Go to the “Cluster” tab and click on the “Choose” button. Select the clustering method as “SimpleKMeans”.
#3) Choose Settings and then set the following fields:
- Distance function as Euclidian
- The number of clusters as 6. With more number of clusters, the sum of squared error will reduce.
- Seed as 10. of
Click on Ok and start the algorithm.
#4) Click on Start in the left panel. The algorithm display results on the white screen. Let us analyze the run information:
- Scheme, Relation, Instances, and Attributes describe the property of the dataset and the clustering method used. In this case, vote.arff dataset has 435 instances and 13 attributes.
- With the Kmeans cluster, the number of iterations is 5.
- The sum of the squared error is 1098.0. This error will reduce with an increase in the number of clusters.
- The 5 final clusters with centroids are represented in the form of a table. In our case, Centroids of clusters are 168.0, 47.0, 37.0, 188.8.131.52 and 28.0.
- Clustered instances represent the number and percentage of total instances falling in the cluster.
#5) Choose “Classes to Clusters Evaluations” and click on Start.
The algorithm will assign the class label to the cluster. Cluster 0 represents republican and Cluster 3 represents democrat. The Incorrectly clustered instance is 39.77% which can be reduced by ignoring the unimportant attributes.
#6) To ignore the unimportant attributes. Click on the “Ignore attributes” button and select the attributes to be removed.
#7) Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on any box. Move the Jitter to the max.
- The X-axis and Y-axis represent the attribute.
- The blue color represents class label democrat and the red color represents class label republican.
- Jitter is used to view Clusters.
- Click the box on the right-hand side of the window to change the x coordinate attribute and view clustering with respect to other attributes.
K means clustering is a simple cluster analysis method. The number of clusters can be set using the setting tab. The centroid of each cluster is calculated as the mean of all points within the clusters. With the increase in the number of clusters, the sum of square errors is reduced. The objects within the cluster exhibit similar characteristics and properties. The clusters represent the class labels.
Implement Data Visualization Using WEKA
The method of representing data through graphs and plots with the aim to understand data clearly is data visualization.
There are many ways to represent data. Some of them are as follows:
#1) Pixel Oriented Visualization: Here the color of the pixel represents the dimension value. The color of the pixel represents the corresponding values.
#2) Geometric Representation: The multidimensional datasets are represented in 2D, 3D, and 4D scatter plots.
#3) Icon Based Visualization: The data is represented using Chernoff’s faces and stick figures. Chernoff’s faces use the human mind’s ability to recognize facial characteristics and differences between them. The stick figure uses 5 stick figures to represent multidimensional data.
#4) Hierarchical Data Visualization: The datasets are represented using treemaps. It represents hierarchical data as a set of nested triangles.
Data Visualization Using WEKA Explorer
Data Visualization using WEKA is done on the IRIS.arff dataset.
Steps involved are as follows:
#1) Go to the Preprocess tab and open IRIS.arff dataset.
#2) The dataset has 4 attributes and 1 class label. The attributes in this dataset are:
- Sepallength : Type -numeric
- Sepalwidth: Type- numeric
- Petalength: Type-numeric
- Petalwidth: Type-numeric
- Class: Type-nominal
#3) To visualize the dataset, go to the Visualize tab. The tab shows the attributes plot matrix. The dataset attributes are marked on the x-axis and y-axis while the instances are plotted. The box with x-axis attribute and y-axis attribute can be enlarged.
#4) Click on the box of the plot to enlarge. For example, x: petallength and y:petalwidth. The class labels are represented in different colors.
- Class label- Iris-setosa: blue color
- Class label- Iris-versicolor: red
- Class label-Iris-virginica-green
These colors can be changed. To change the color, click on the class label at the bottom, a color window will appear.
#5) Click on the instance represented by ‘x’ in the plot. It will give the instance details. For example:
- Instance number: 91
- Sepalength: 5.5
- Sepalwidth: 2.6
- Petalength: 4.4
- Petalwidth: 1.2
- Class: Iris-versicolor
Some of the points in the plot appear darker than other points. These points represent 2 or more instances with the same class label and the same value of attributes plotted on the graph such as petalwidth and petallength.
The figure below represents a point with 2 instance information.
#6) The X and Y-axis attributes can be changed from the right panel in Visualize graph. The user can view different plots.
#7) The Jitter is used to add randomness to the plot. Sometimes the points overlap. With jitter, the darker spots represent multiple instances.
#8) To get a clearer view of the dataset and remove outliers, the user can select an instance from the dropdown. Click on “select instance” dropdown. Choose “Rectangle”. With this, the user will be able to select points in the plot by plotting a rectangle.
#9) Click on “Submit”. Only the selected dataset points will be displayed and the other points will be excluded from the graph.
The figure below shows the points from the selected rectangular shape. The plot represents points with only 3 class labels. The user can click on “Save” to save the dataset or “Reset” to select another instance. The dataset will be saved in a separate .ARFF file.
Data visualization using WEKA is simplified with the help of the box plot. The user can view any level of granularity. The attributes are plotted on X-axis and y-axis while the instances are plotted against the X and Y-axis. Some points represent multiple instances which are represented by points with dark color.
WEKA is an efficient data mining tool to perform many data mining tasks as well as experiment with new methods over datasets. WEKA has been developed by the Department of Computer Science, the University of Waikato in New Zealand.
Today’s world is overwhelmed with data right from shopping in the supermarket to security cameras at our home. Data mining uses this raw data, converts it to information to make predictions. WEKA with the help of the Apriori Algorithm helps in mining association rules in the dataset. Apriori is a frequent pattern mining algorithm that counts the number of occurrences of an itemset in the transaction.
Cluster Analysis is a technique to find out clusters of data that represent similar characteristics. WEKA provides many algorithms to perform cluster analysis out of which simplekmeans are highly used.
Data Visualization in WEKA can be performed on all datasets in the WEKA directory. The raw dataset can be viewed as well as other resultant datasets of other algorithms such as classification, clustering, and association can be visualized using WEKA.