How to Use Weka Datasets: Free ARFFs and J48 Algorithm

This tutorial explains WEKA Datasets, Classifier, and J48 Algorithm for Decision Tree. Also provides information about sample ARFF datasets for Weka.

In the previous tutorial, we learned about the Weka Machine Learning tool, its features, and how to download, install, and use Weka Machine Learning software.

For those who are learning machine learning using WEKA, finding the appropriate data set can be the most difficult part. This article will present some of the most common data sets used in WEKA, such as Iris.arff, Weather, Glass, and Diabetes, as well as how to download and load them, and examples for the J48 algorithm.

For both beginners and students, this article will explain how to select the appropriate data set and use it with the help of WEKA.

Table of Contents:

What is WEKA?
What Are WEKA Datasets?
Best WEKA Sample Datasets for Beginners
How to Download and Install WEKA Datasets
- Understanding ARFF Files: Real-Valued vs Nominal Attributes
What Is a Decision Tree?
Decision Tree Algorithms in WEKA
- J48 Algorithm in WEKA (Step-by-Step Example)
Common WEKA Dataset Loading Errors and Their Solutions
Conclusion

What is WEKA?

WEKA (Waikato Environment for Knowledge Analysis) is a library of machine learning algorithms to solve data mining problems on real data. WEKA also provides an environment to develop many machine-learning algorithms. It has a set of tools for carrying out various data mining tasks such as data classification, data clustering, regression, attribute selection, frequent itemset mining, and so on.

=> Read Through The Complete Machine Learning Training Series

In this tutorial, we will see some sample datasets in WEKA and will also perform decision tree algorithm data mining using weather.arff dataset.

What Are WEKA Datasets?

WEKA datasets are simply some sample data files that are utilized within the WEKA machine learning application in order to test and analyze data mining algorithms. The majority of the WEKA datasets is saved in the ARFF (Attribute-Relation File Format) and contains both structure of the dataset and the actual data. Such as: iris.arff, weather.nominal.arff, and credit-g.arff.You can download these datasets from the WEKA sample dataset repository.

The WEKA machine learning tool provides a directory of some sample datasets. These datasets can be directly loaded into WEKA for users to start developing models immediately.

The WEKA datasets can be explored from the “C:\Program Files\Weka-3-8\data” link. The datasets are in .arff format.

Best WEKA Sample Datasets for Beginners

Some sample datasets present in WEKA are enlisted in the table below:

S.No.	Sample Datasets
1.	airline.arff
2.	breast-cancer.arff
3.	contact-lens.arff
4.	cpu.arff
5.	cpu.with-vendor.arff
6.	credit-g.arff
7.	diabetes.arff
8.	glass.arff
9.	hypothyroid.arff
10.	ionospehre.arff
11.	iris.2D.arff
12.	iris.arff
13.	labor.arff
14.	ReutersCorn-train.arff
15.	ReutersCorn-test.arff
16.	ReutersGrain-train.arff
17.	ReutersGrain-test.arff
18.	segment-challenge.arff
19.	segment-test.arff
20.	soybean.arff
21.	supermarket.arff
22.	unbalanced.arff
23.	vote.arff
24.	weather.numeric.arff
25.	weather.nominal.arff

Let’s take a look at some of these:

contact-lens.arff

contact-lens.arff dataset is a database for fitting contact lenses. It was donated by the donor, Benoit Julien in the year 1990.

Database: This database is complete. The examples used in this database are complete and noise-free. The database has 24 instances and 4 attributes.

Attributes: All four attributes are nominal. There are no missing attribute values. The four attributes are as follows:

#1) Age of the patient: The attribute age can take values:

young
pre-presbyopic
presbyopic

#2) Spectacle prescription: This attribute can take values:

myope
hypermetrope

#3) Astigmatic: This attribute can take values

#4) Tear production rate: The values can be

reduced
normal

Class: Three class labels are defined here. These are:

the patient should be fitted with hard contact lenses.
the patient should be fitted with soft contact lenses.
the patient should not be fitted with contact lenses.

Class Distribution: The instances that are classified into class labels are enlisted below:

	Class Label	No of Instances
1.	Hard contact lenses	4
2.	Soft contact lenses	5
3.	No contact lenses	15

iris.arff

iris.arff dataset was created in 1988 by Michael Marshall. It is the Iris Plants database.

Database: This database is used for pattern recognition. The data set contains 3 classes of 50 instances. Each class represents a type of iris plant. One class is linearly separable from the other 2 but the latter are not linearly separable from each other. It predicts to which species of the 3 iris flowers the observation belongs. This is called a multi-class classification dataset.

Attributes: It has 4 numeric, predictive attributes, and the class. There are no missing attributes.

The attributes are:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica

Summary Statistics:

	Min	Max	Mean	SD	Class Correlation
sepal length	4.3	7.9	5.84	0.83	0.7826
sepal width	2.0	4.4	3.05	0.43	-0.4194
petal length	1.0	6.9	3.76	1.76	0.9490 (high!)
petal width	0.1	2.5	1.20	0.76	0.9565 (high!)

Class Distribution: 33.3% for each of 3 classes

Some other Datasets:

diabetes.arff

The database of this dataset is Pima Indians Diabetes. This dataset predicts whether the patient is prone to be diabetic in the next 5 years. The patients in this dataset are all females of at least 21 years of age from Pima Indian Heritage. It has 768 instances and 8 numerical attributes plus a class. This is a binary classification dataset where the output variable predicted is nominal comprising of two classes.

ionosphere.arff

This is a popular dataset for binary classification. The instance in this dataset describes the properties of radar returns from the atmosphere. It is used to predict whether the ionosphere has some structure or not. It has 34 numerical attributes and a class.

The class attribute is “good” or “bad” which is predicted based on 34 attribute observations. The received signals are processed by the autocorrelation function taking time pulse and pulse number as arguments.

Regression Datasets

The regression datasets can be downloaded from the WEKA webpage “Collections of datasets”. It has 37 regression problems obtained from different sources. The downloaded file will create a numeric/directory with regression datasets in .arff format.

The popular datasets present in the directory are the Longley economic dataset (longley.arff), Boston house price dataset (housing.arff), and sleep in mammals data set (sleep.arff).

Let us now see how to identify real-valued and nominal attributes in the dataset using WEKA explorer.

How to Download and Install WEKA Datasets

Download and installation of the WEKA datasets is easy. You can use the sample datasets from the WEKA website and install them in the WEKA interface within a matter of minutes. For a comprehensive step-by-step guide, visit our page dedicated to downloading and installing WEKA datasets.

Understanding ARFF Files: Real-Valued vs Nominal Attributes

Real valued attributes are numeric attributes containing only real values. These are measurable quantities. These attributes can be interval scaled such as temperature or ratio scaled such as mean, or median.

Nominal attributes represent names or some representation of things. There is no order in such attributes and they represent some category. For example, color.

Follow the steps enlisted below to use WEKA for identifying real values and nominal attributes in the dataset.

#1) Open WEKA and select “Explorer” under ‘Applications’.

#2) Select the “Pre-Process” tab. Click on “Open File”. With WEKA users, you can access WEKA sample files.

#3) Select the input file from the WEKA3.8 folder stored on the local system. Select the predefined .arff file “credit-g.arff” file and click on “Open”.

#4) An attribute list will open on the left panel. Selected attribute statistics will be shown on the right panel along with the histogram.

Analysis of the dataset:

In the left panel the current relation shows:

Relation name: german_credit is the sample file.
Instances: 1000 number of data rows in the dataset.
Attributes: 21 attributes in the dataset.

The panel below the current relation shows the name of attributes.

In the right panel, the selected attribute statistics are displayed. Select the attribute “checking_status”.

It shows:

Name of the attribute
Missing: Any missing values of the attribute in the dataset. 0% in this case.
Distinct: The attribute has 4 distinct values.
Type: The attribute is of the nominal type that is, it does not take any numeric value.
Count: Among the 1000 instances, the count of each distinct class label is written in the count column.
Histogram: It will display the output class label for the attribute. The class label in this dataset is either good or bad. There are 700 instances of good (marked in blue) and 300 instances of bad (marked in red).
- For the label < 0, the instances for good or bad are almost the same in number.
- For label, 0<= X<200, the instances with decision good are more than instances with bad.
- Similarly, for label >= 200, the max instances occur for good and no checking label has more instances with decision good.

For the next attribute “duration”.

The right panel shows:

Name: This is the Name of the attribute.
Type: The type of the attribute is numeric.
Missing value: The attribute does not have any missing value.
Distinct: It has 33 distinct values in 1000 instances. It means in 1000 instances it has 33 distinct values.
Unique: It has 5 unique values that do not match with each other.
Minimum value: The min value of the attribute is 4.
Maximum Value: The max value of the attribute is 72.
Mean: Mean is adding all the values divided by instances.
Standard Deviation: Stddeviation of attribute duration.
Histogram: The histogram depicts the duration of 4 units, the max instances occur for a good class. As the duration increases to 38 units, the number of instances reduces for good class labels. The duration reaches 72 units which has only one instance which classifies the decision as bad.

The class is the classification feature of the nominal type. It has two distinct values: good and bad. The good class label has 700 instances and the bad class label has 300 instances.

To visualize all the attributes of the dataset, click on “Visualize All”.

#5) To find out only numeric attributes, click on the Filter button. From there, click on Choose ->WEKA >FILTERS -> Unsupervised Type ->Remove Type.

WEKA filters have many functionalities to transform the attribute values of the dataset to make it suitable for the algorithms. For example, the numeric transformation of attributes.

Filtering the nominal and real-valued attributes from the dataset is another example of using WEKA filters.

#6) Click on the RemoveType in the filter tab. An object editor window will open. Select attributeType “Delete numeric attributes” and click on OK.

#7) Apply the filter. Only numeric attributes will be shown.

The class attribute is of the nominal type. It classifies the output and hence cannot be deleted. Thus it is seen with the numeric attribute.

Output:

The real-valued and nominal values attributes in the dataset are identified. Visualization with the class label is seen in the form of histograms.

What Is a Decision Tree?

A decision Tree is the classification technique that consists of three components root node, branch (edge or link), and leaf node. Root represents the test condition for different attributes, the branch represents all possible outcomes that can be there in the test, and leaf nodes contain the label of the class to which it belongs. The root node is at the starting of the tree which is also called the top of the tree.

How Decision Trees Work

The decision tree begins with the whole dataset being the root node and makes the splits recursively according to the feature which allows separating data better, and so forth until the criteria of stopping the split are satisfied, such as all records in the subset belong to the same class.

Advantages of Decision Trees

Simple and visualizable algorithm.
Minimal preprocessing required.
Can be applied to both numeric and categorical features.
Works for both classification and regression tasks.
Allows identifying of the most important features to predict outcomes.

Limitations of Decision Trees

Susceptible to overfitting due to its high variance.
Sensitivity to small variations in the dataset.
Low performance on highly imbalanced datasets.
Less precise than ensemble methods (Random Forest).

Decision Tree Algorithms in WEKA

Now, we will see how to implement decision tree classification on weather.nominal.arff dataset using the J48 classifier.

weather.nominal.arff

It is a sample dataset present in the direct of WEKA. This dataset predicts if the weather is suitable for playing cricket. The dataset has 5 attributes and 14 instances. The class label “play” classifies the output as “yes’ or “no”.

J48 Algorithm in WEKA (Step-by-Step Example)

It is an algorithm to generate a decision tree that is generated by C4.5 (an extension of ID3). It is also known as a statistical classifier. For decision tree classification, we need a database.

Steps include:

#1) Open WEKA explorer.

#2) Select weather.nominal.arff file from the “choose file” under the preprocess tab option.

#3) Go to the “Classify” tab for classifying the unclassified data. Click on the “Choose” button. From this, select “trees -> J48”. Let us also have a quick look at other options in the Choose button:

Bayes: It is a density estimation for numerical attributes.
Meta: It is a multi-response linear regression.
Functions: It is logistic regression.
Lazy: It sets the blend entropy automatically.
Rule: It is a rule learner.
Trees: Trees classify the data.

#4) Click on Start Button. The classifier output will be seen on the Right-hand panel. It shows the run information in the panel as:

Scheme: The classification algorithm used.
Instances: Number of data rows in the dataset.
Attributes: The dataset has 5 attributes.
The number of leaves and the size of the tree describes the decision tree.
Time is taken to build the model: Time for the output.
Full classification of the J48 pruned with the attributes and number of instances.

#5) To visualize the tree, right-click on the result and select visualize the tree.

Output:

The output is in the form of a decision tree. The main attribute is “outlook”.

If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then the class label play= “yes”.

If the outlook is overcast, the class label, play is “yes”. The number of instances which obey the classification is 4.

If outlook is rainy, further classification takes place to analyze the attribute “windy”. If windy=true, the play = “no”. The number of instances which obey the classification for outlook= windy and windy=true is 2.

Common WEKA Dataset Loading Errors and Their Solutions

While importing a dataset into WEKA is typically a smooth process, problems can occur if the dataset is improperly formatted. Some of the problems you might run into and their respective solutions are listed below.

Incompatible ARFF Format: WEKA is a software package that expects certain formats to be followed in order to import a dataset. This includes @relation, @attribute, and @data among others. Failure to follow the format may prevent the dataset from being imported.

Solution: Make sure your ARFF file follows the proper format.

Unsupported File Extension: The only file extensions that WEKA currently supports include .arff and .csv. Importing other file extensions such as Excel (.xlsx) and JSON will result in an error.

Solution: Make sure the dataset is converted into one of the supported formats.

No Attribute Definition: For WEKA to understand the dataset, each column must have its own attribute definition.

Solution: Make sure that you’ve defined the data columns before @data.

Incompatible Data Types: Each data value must have a compatible data type. Using text where numbers are expected or wrong nominal categories can result in errors.

Solution: Double-check the attribute definitions in the ARFF file.

Incorrectly Represented Missing Values: Missing values need to be represented by a question mark (?). Not doing so, and instead leaving the values empty or representing them by any other text value can lead to problems.

Solution: Make sure missing values are marked with ?.

Incompatible Character Encoding: A dataset saved in incompatible character encoding may be displayed incorrectly.

Solution: Try saving your file in UTF-8.

Large Dataset Problems: Large datasets can take up a lot of time to be imported and use more memory.

Solution: You can allocate more Java heap memory to WEKA or import a small sample before importing the entire file.

What is the J48 algorithm in Weka?

The J48 algorithm is an open-source version of C4.5 implemented in Java by Weka. It is utilized for constructing either pruned or unpruned decision trees as classifiers for given datasets. The J48 algorithm analyzes the information gain of the attributes for splitting data instances into reasonable branches. Thus, it is one of the most popular machine learning algorithms for solving classification problems.

Where can I find standard sample datasets already built into Weka?

: Once you install Weka locally, it will create the directory data automatically in the default installation directory (C:\Program Files\Weka-3-9\data). The samples of well-known benchmark datasets, including iris.arff, weather.nominal.arff, diabetes.arff, etc., will be available there for immediate loading.

How do I handle missing values in my Weka dataset?

Weka has a native representation for missing values, which is the symbol of question mark (?), in .arff files. In case your dataset has missing blocks, it will not be difficult to deal with them during pre-processing. To do this, just go to the Filters menu, select unsupervised.attribute.ReplaceMissingValues and apply it to the dataset before running any algorithms like J48.

Conclusion

WEKA offers a wide range of sample datasets to apply machine learning algorithms. The users can perform machine learning tasks such as classification, regression, attribute selection, and association on these sample datasets, and can also learn the tool using them.

WEKA explorer is used for performing several functions, starting from preprocessing. Preprocessing takes input as a .arff file, processes the input, and gives an output that can be used by other computer programs. In WEKA the output of preprocessing gives the attributes present in the dataset which can be further used for statistical analysis and comparison with class labels.

WEKA also offers many classification algorithms for decision trees. J48 is one of the popular classification algorithms which outputs a decision tree. Using the Classify tab the user can visualize the decision tree. If the decision tree is too populated, tree pruning can be applied from the Preprocess tab by removing the attributes which are not required and starting the classification process again.

=> Visit Here For The Exclusive Machine Learning Series

Was this helpful?

Thanks for your feedback!