WEKA Dataset, Classifier And J48 Algorithm For Decision Tree

This tutorial explains WEKA Dataset, Classifier, and J48 Algorithm for Decision Tree. Also provides information about sample ARFF datasets for Weka:

In the Previous tutorial, we learned about the Weka Machine Learning tool, its features, and how to download, install, and use Weka Machine Learning software.

WEKA is a library of machine learning algorithms to solve data mining problems on real data. WEKA also provides an environment to develop many machine learning algorithms. It has a set of tools for carrying out various data mining tasks such as data classification, data clustering, regression, attribute selection, frequent itemset mining, and so on.

All these tasks can be carried out on the sample.ARFF file available in WEKA repository or users can prepare their data files. The sample .arff files are datasets that have built-in historical data collected by researchers.

=> Read Through The Complete Machine Learning Training Series

WEKA DataSets

In this tutorial, we will see some sample datasets in WEKA and will also perform decision tree algorithm data mining using weather.arff dataset.

Exploring WEKA Datasets

The WEKA machine learning tool provides a directory of some sample datasets. These datasets can be directly loaded into WEKA for users to start developing models immediately.

The WEKA datasets can be explored from the “C:\Program Files\Weka-3-8\data” link. The datasets are in .arff format.

Explore datasets

Sample WEKA Datasets

Some sample datasets present in WEKA are enlisted in the table below:

S.No.Sample Datasets
1.airline.arff
2.breast-cancer.arff
3.contact-lens.arff
4.cpu.arff
5.cpu.with-vendor.arff
6.credit-g.arff
7.diabetes.arff
8.glass.arff
9.hypothyroid.arff
10.ionospehre.arff
11.iris.2D.arff
12.iris.arff
13.labor.arff
14.ReutersCorn-train.arff
15.ReutersCorn-test.arff
16.ReutersGrain-train.arff
17.ReutersGrain-test.arff
18.segment-challenge.arff
19.segment-test.arff
20.soybean.arff
21.supermarket.arff
22.unbalanced.arff
23.vote.arff
24.weather.numeric.arff
25.weather.nominal.arff

Let’s take a look at some of these:

contact-lens.arff

contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor, Benoit Julien in the year 1990.

ontactLenses dataset

Database: This database is complete. The examples used in this database are complete and noise-free. The database has 24 instances and 4 attributes.

Attributes: All four attributes are nominal. There are no missing attribute values. The four attributes are as follows:

#1) Age of the patient: The attribute age can take values:

  • young
  • pre-presbyopic
  • presbyopic

#2) Spectacle prescription: This attribute can take values:

  • myope
  • hypermetrope

#3) Astigmatic: This attribute can take values

  • no
  • yes

#4) Tear production rate: The values can be

  • reduced
  • normal

Class: Three class labels are defined here. These are:

  • the patient should be fitted with hard contact lenses.
  • the patient should be fitted with soft contact lenses.
  • the patient should not be fitted with contact lenses.

Class Distribution: The instances that are classified into class labels are enlisted below:

 Class Label No of Instances
1.    Hard contact lenses 4
2. Soft contact lenses5
3.    No contact lenses15

iris.arff

iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.

iris.arff

Database: This database is used for pattern recognition. The data set contains 3 classes of 50 instances. Each class represents a type of iris plant. One class is linearly separable from the other 2 but the latter are not linearly separable from each other. It predicts to which species of the 3 iris flower the observation belongs. This is called a multi-class classification dataset.

Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing attributes.

The attributes are:

  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm
  • class:
    • Iris Setosa
    • Iris Versicolour
    • Iris Virginica

Summary Statistics:

 MinMaxMeanSDClass Correlation
sepal length 4.37.95.840.830.7826
sepal width2.04.43.050.43 -0.4194
petal length1.06.93.761.76 0.9490 (high!)
petal width0.12.51.200.76 0.9565 (high!)

Class Distribution: 33.3% for each of 3 classes

Some other Datasets:

diabetes.arff

The database of this dataset is Pima Indians Diabetes. This dataset predicts whether the patient is prone to be diabetic in the next 5 years. The patients in this dataset are all females of at least 21 years of age from Pima Indian Heritage. It has 768 instances and 8 numerical attributes plus a class. This is a binary classification dataset where the output variable predicted is nominal comprising of two classes.

ionosphere.arff

This is a popular dataset for binary classification. The instance in this dataset describes the properties of radar returns from the atmosphere. It is used to predict where the ionosphere has some structure or not. It has 34 numerical attributes and a class.

The class attribute is “good” or “bad” which is predicted based on 34 attributes observation. The received signals are processed by autocorrelation function taking time pulse and pulse number as arguments.

Regression Datasets

The regression datasets can be downloaded from the WEKA webpage “Collections of datasets”. It has 37 regression problems obtained from different sources. The downloaded file will create numeric/directory with regression datasets in .arff format.

The popular datasets present in the directory are: Longley economic dataset (longley.arff), Boston house price dataset (housing.arff), and sleep in mammals data set (sleep.arff).

Let us now see how to identify real-valued and nominal attributes in the dataset using WEKA explorer.

What Are Real-valued And Nominal Attributes

Real valued attributes are numeric attributes containing only real values. These are measurable quantities. These attributes can be interval scaled such as temperature or ratio scaled such as mean, median.

Nominal attributes represent names or some representation of things. There is no order in such attributes and they represent some category. For example, color.

Follow the steps enlisted below to use WEKA for identifying real values and nominal attributes in the dataset.

#1) Open WEKA and select “Explorer” under ‘Applications’.

WEKA Explorer

#2) Select the “Pre-Process” tab. Click on “Open File”. With WEKA user, you can access WEKA sample files.

Select Pre-Process

#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the predefined .arff file “credit-g.arff” file and click on “Open”.

Select the predefined .arff file “credit-g.arff” file

#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on the right panel along with the histogram.

Analysis of the dataset:

In the left panel the current relation shows:

  • Relation name: german_credit is the sample file.
  • Instances: 1000 number of data rows in the dataset.
  • Attributes: 21 attributes in the dataset.

The panel below current relation shows the name of attributes.

In the right panel, the selected attribute statistics are displayed. Select the attribute “checking_status”.

It shows:

  • Name of the attribute
  • Missing: Any missing values of the attribute in the dataset. 0% in this case.
  • Distinct: The attribute has 4 distinct values.
  • Type: The attribute is of the nominal type that is, it does not take any numeric value.
  • Count: Among the 1000 instances, the count of each distinct class label is written in the count column.
  • Histogram: It will display the output class label for the attribute. The class label in this dataset is either good or bad. There are 700 instances of good (marked in blue) and 300 instances of bad (marked in red).
    • For the label < 0, the instances for good or bad are almost the same in number.
    • For label, 0<= X<200, the instances with decision good are more than instances with bad.
    • Similarly, for label >= 200, the max instances occur for good and no checking label has more instances with decision good.

select attribute

For the next attribute “duration”.

The right panel shows:

  • Name: This is the Name of the attribute.
  • Type: Type of the attribute is numeric.
  • Missing value: The attribute does not have any missing value.
  • Distinct: It has 33 distinct values in 1000 instances. It means in 1000 instances it has 33 distinct values.
  • Unique: It has 5 unique values that do not match with each other.
  • Minimum value: The min value of the attribute is 4.
  • Maximum Value: The max value of the attribute is 72.
  • Mean: Mean is adding all the values divided by instances.
  • Standard Deviation: Stddeviation of attribute duration.
  • Histogram: The histogram depicts the duration of 4 units, the max instances occur for a good class. As the duration increases to 38 units, the number of instances reduces for good class labels. The duration reaches 72 units which have only one instance which classifies decision as bad.

attribute “duration”

histogram

The class is the classification feature of the nominal type. It has two distinct values: good and bad. The good class label has 700 instances and the bad class label has 300 instances.

Class label

To visualize all the attributes of the dataset, click on “Visualize All”.

Visualize all

#5) To find out only numeric attributes, click on the Filter button. From there, click on Choose ->WEKA >FILTERS -> Unsupervised Type ->Remove Type.

WEKA filters have many functionalities to transform the attribute values of the dataset to make it suitable for the algorithms. For example, the numeric transformation of attributes.

Filtering the nominal and real-valued attributes from the dataset is another example of using WEKA filters.

WEKA filter

#6) Click on the RemoveType in the filter tab. An object editor window will open. Select attributeType “Delete numeric attributes” and click on OK.

Delete numericl attributes

#7) Apply the filter. Only numeric attributes will be shown.

The class attribute is of the nominal type. It classifies the output and hence cannot be deleted. Thus it is seen with the numeric attribute.

Only numeric

Output:

The real-valued and nominal values attributes in the dataset are identified. Visualization with the class label is seen in the form of histograms.

Weka Decision Tree Classification Algorithms

Now, we will see how to implement decision tree classification on weather.nominal.arff dataset using the J48 classifier.

weather.nominal.arff

It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label “play” classifies the output as “yes’ or “no”.

What Is Decision Tree

Decision Tree is the classification technique that consists of three components root node, branch (edge or link), and leaf node. Root represents the test condition for different attributes, the branch represents all possible outcomes that can be there in the test, and leaf nodes contain the label of the class to which it belongs. The root node is at the starting of the tree which is also called the top of the tree.

J48 Classifier

It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3). It is also known as a statistical classifier. For decision tree classification, we need a database.

Steps include:

#1) Open WEKA explorer.

#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.

Choose dataset

#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose” button. From this, select “trees -> J48”. Let us also have a quick look at other options in the Choose button:

  • Bayes: It is a density estimation for numerical attributes.
  • Meta: It is a multi-response linear regression.
  • Functions: It is logistic regression.
  • Lazy: It sets the blend entropy automatically.
  • Rule: It is a rule learner.
  • Trees: Trees classifies the data.

Classify tab

#4) Click on Start Button. The classifier output will be seen on the Right-hand panel. It shows the run information in the panel as:

  • Scheme: The classification algorithm used.
  • Instances: Number of data rows in the dataset.
  • Attributes: The dataset has 5 attributes.
  • The number of leaves and the size of the tree describes the decision tree.
  • Time taken to build the model: Time for the output.
  • Full classification of the J48 pruned with the attributes and number of instances.

Classified output information

Visualize tree

#5) To visualize the tree, right-click on the result and select visualize the tree.

 Decision tree

Output:

The output is in the form of a decision tree. The main attribute is “outlook”.

If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then class label play= “yes”.

If the outlook is overcast, the class label, play is “yes”. The number of instances which obey the classification is 4.

If outlook is rainy, further classification takes place to analyze the attribute “windy”. If windy=true, the play = “no”. The number of instances which obey the classification for outlook= windy and windy=true is 2.

Conclusion

WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users can perform machine learning tasks such as classification, regression, attribute selection, association on these sample datasets, and can also learn the tool using them.

WEKA explorer is used for performing several functions, starting from preprocessing. Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used by other computer programs. In WEKA the output of preprocessing gives the attributes present in the dataset which can be further used for statistical analysis and comparison with class labels.

WEKA also offers many classification algorithms for decision tree. J48 is one of the popular classification algorithms which outputs a decision tree. Using the Classify tab the user can visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from the Preprocess tab by removing the attributes which are not required and start the classification process again.

=> Visit Here For The Exclusive Machine Learning Series