This tutorial explains WEKA Dataset, Classifier, and J48 Algorithm for Decision Tree. Also provides information about sample ARFF datasets for Weka:
In the Previous tutorial, we learned about the Weka Machine Learning tool, its features, and how to download, install, and use Weka Machine Learning software.
WEKA is a library of machine learning algorithms to solve data mining problems on real data. WEKA also provides an environment to develop many machine learning algorithms. It has a set of tools for carrying out various data mining tasks such as data classification, data clustering, regression, attribute selection, frequent itemset mining, and so on.
All these tasks can be carried out on the sample.ARFF file available in WEKA repository or users can prepare their data files. The sample .arff files are datasets that have built-in historical data collected by researchers.
In this tutorial, we will see some sample datasets in WEKA and will also perform decision tree algorithm data mining using weather.arff dataset.
What You Will Learn:
Exploring WEKA Datasets
The WEKA machine learning tool provides a directory of some sample datasets. These datasets can be directly loaded into WEKA for users to start developing models immediately.
The WEKA datasets can be explored from the “C:\Program Files\Weka-3-8\data” link. The datasets are in .arff format.
Sample WEKA Datasets
Some sample datasets present in WEKA are enlisted in the table below:
Let’s take a look at some of these:
contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor, Benoit Julien in the year 1990.
Database: This database is complete. The examples used in this database are complete and noise-free. The database has 24 instances and 4 attributes.
Attributes: All four attributes are nominal. There are no missing attribute values. The four attributes are as follows:
#1) Age of the patient: The attribute age can take values:
#2) Spectacle prescription: This attribute can take values:
#3) Astigmatic: This attribute can take values
#4) Tear production rate: The values can be
Class: Three class labels are defined here. These are:
- the patient should be fitted with hard contact lenses.
- the patient should be fitted with soft contact lenses.
- the patient should not be fitted with contact lenses.
Class Distribution: The instances that are classified into class labels are enlisted below:
|Class Label||No of Instances|
|1.||Hard contact lenses||4|
|2.||Soft contact lenses||5|
|3.||No contact lenses||15|
iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.
Database: This database is used for pattern recognition. The data set contains 3 classes of 50 instances. Each class represents a type of iris plant. One class is linearly separable from the other 2 but the latter are not linearly separable from each other. It predicts to which species of the 3 iris flower the observation belongs. This is called a multi-class classification dataset.
Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing attributes.
The attributes are:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- Iris Setosa
- Iris Versicolour
- Iris Virginica
|petal length||1.0||6.9||3.76||1.76||0.9490 (high!)|
|petal width||0.1||2.5||1.20||0.76||0.9565 (high!)|
Class Distribution: 33.3% for each of 3 classes
Some other Datasets:
The database of this dataset is Pima Indians Diabetes. This dataset predicts whether the patient is prone to be diabetic in the next 5 years. The patients in this dataset are all females of at least 21 years of age from Pima Indian Heritage. It has 768 instances and 8 numerical attributes plus a class. This is a binary classification dataset where the output variable predicted is nominal comprising of two classes.
This is a popular dataset for binary classification. The instance in this dataset describes the properties of radar returns from the atmosphere. It is used to predict where the ionosphere has some structure or not. It has 34 numerical attributes and a class.
The class attribute is “good” or “bad” which is predicted based on 34 attributes observation. The received signals are processed by autocorrelation function taking time pulse and pulse number as arguments.
The regression datasets can be downloaded from the WEKA webpage “Collections of datasets”. It has 37 regression problems obtained from different sources. The downloaded file will create numeric/directory with regression datasets in .arff format.
The popular datasets present in the directory are: Longley economic dataset (longley.arff), Boston house price dataset (housing.arff), and sleep in mammals data set (sleep.arff).
Let us now see how to identify real-valued and nominal attributes in the dataset using WEKA explorer.
What Are Real-valued And Nominal Attributes
Real valued attributes are numeric attributes containing only real values. These are measurable quantities. These attributes can be interval scaled such as temperature or ratio scaled such as mean, median.
Nominal attributes represent names or some representation of things. There is no order in such attributes and they represent some category. For example, color.
Follow the steps enlisted below to use WEKA for identifying real values and nominal attributes in the dataset.
#1) Open WEKA and select “Explorer” under ‘Applications’.
#2) Select the “Pre-Process” tab. Click on “Open File”. With WEKA user, you can access WEKA sample files.
#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the predefined .arff file “credit-g.arff” file and click on “Open”.
#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on the right panel along with the histogram.
Analysis of the dataset:
In the left panel the current relation shows:
- Relation name: german_credit is the sample file.
- Instances: 1000 number of data rows in the dataset.
- Attributes: 21 attributes in the dataset.
The panel below current relation shows the name of attributes.
In the right panel, the selected attribute statistics are displayed. Select the attribute “checking_status”.
- Name of the attribute
- Missing: Any missing values of the attribute in the dataset. 0% in this case.
- Distinct: The attribute has 4 distinct values.
- Type: The attribute is of the nominal type that is, it does not take any numeric value.
- Count: Among the 1000 instances, the count of each distinct class label is written in the count column.
- Histogram: It will display the output class label for the attribute. The class label in this dataset is either good or bad. There are 700 instances of good (marked in blue) and 300 instances of bad (marked in red).
- For the label < 0, the instances for good or bad are almost the same in number.
- For label, 0<= X<200, the instances with decision good are more than instances with bad.
- Similarly, for label >= 200, the max instances occur for good and no checking label has more instances with decision good.
For the next attribute “duration”.
The right panel shows:
- Name: This is the Name of the attribute.
- Type: Type of the attribute is numeric.
- Missing value: The attribute does not have any missing value.
- Distinct: It has 33 distinct values in 1000 instances. It means in 1000 instances it has 33 distinct values.
- Unique: It has 5 unique values that do not match with each other.
- Minimum value: The min value of the attribute is 4.
- Maximum Value: The max value of the attribute is 72.
- Mean: Mean is adding all the values divided by instances.
- Standard Deviation: Stddeviation of attribute duration.
- Histogram: The histogram depicts the duration of 4 units, the max instances occur for a good class. As the duration increases to 38 units, the number of instances reduces for good class labels. The duration reaches 72 units which have only one instance which classifies decision as bad.
The class is the classification feature of the nominal type. It has two distinct values: good and bad. The good class label has 700 instances and the bad class label has 300 instances.
To visualize all the attributes of the dataset, click on “Visualize All”.
#5) To find out only numeric attributes, click on the Filter button. From there, click on Choose ->WEKA >FILTERS -> Unsupervised Type ->Remove Type.
WEKA filters have many functionalities to transform the attribute values of the dataset to make it suitable for the algorithms. For example, the numeric transformation of attributes.
Filtering the nominal and real-valued attributes from the dataset is another example of using WEKA filters.
#6) Click on the RemoveType in the filter tab. An object editor window will open. Select attributeType “Delete numeric attributes” and click on OK.
#7) Apply the filter. Only numeric attributes will be shown.
The class attribute is of the nominal type. It classifies the output and hence cannot be deleted. Thus it is seen with the numeric attribute.
The real-valued and nominal values attributes in the dataset are identified. Visualization with the class label is seen in the form of histograms.
Weka Decision Tree Classification Algorithms
Now, we will see how to implement decision tree classification on weather.nominal.arff dataset using the J48 classifier.
It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label “play” classifies the output as “yes’ or “no”.
What Is Decision Tree
Decision Tree is the classification technique that consists of three components root node, branch (edge or link), and leaf node. Root represents the test condition for different attributes, the branch represents all possible outcomes that can be there in the test, and leaf nodes contain the label of the class to which it belongs. The root node is at the starting of the tree which is also called the top of the tree.
It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3). It is also known as a statistical classifier. For decision tree classification, we need a database.
#1) Open WEKA explorer.
#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.
#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose” button. From this, select “trees -> J48”. Let us also have a quick look at other options in the Choose button:
- Bayes: It is a density estimation for numerical attributes.
- Meta: It is a multi-response linear regression.
- Functions: It is logistic regression.
- Lazy: It sets the blend entropy automatically.
- Rule: It is a rule learner.
- Trees: Trees classifies the data.
#4) Click on Start Button. The classifier output will be seen on the Right-hand panel. It shows the run information in the panel as:
- Scheme: The classification algorithm used.
- Instances: Number of data rows in the dataset.
- Attributes: The dataset has 5 attributes.
- The number of leaves and the size of the tree describes the decision tree.
- Time taken to build the model: Time for the output.
- Full classification of the J48 pruned with the attributes and number of instances.
#5) To visualize the tree, right-click on the result and select visualize the tree.
The output is in the form of a decision tree. The main attribute is “outlook”.
If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then class label play= “yes”.
If the outlook is overcast, the class label, play is “yes”. The number of instances which obey the classification is 4.
If outlook is rainy, further classification takes place to analyze the attribute “windy”. If windy=true, the play = “no”. The number of instances which obey the classification for outlook= windy and windy=true is 2.
WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users can perform machine learning tasks such as classification, regression, attribute selection, association on these sample datasets, and can also learn the tool using them.
WEKA explorer is used for performing several functions, starting from preprocessing. Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used by other computer programs. In WEKA the output of preprocessing gives the attributes present in the dataset which can be further used for statistical analysis and comparison with class labels.
WEKA also offers many classification algorithms for decision tree. J48 is one of the popular classification algorithms which outputs a decision tree. Using the Classify tab the user can visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from the Preprocess tab by removing the attributes which are not required and start the classification process again.