Car Evaluation Using Machine Learning
✅ Paper Type: Free Essay | ✅ Subject: Computer Science |
✅ Wordcount: 5478 words | ✅ Published: 8th Feb 2020 |
- Abstract.
Cars are essentially part of our regular day to day life. There are various kind of cars produced by different manufacturers; subsequently the buyers has a decision to make.
When as an individual consider of buying a car, there are numerous aspects that could influence his/her choice on which kind of car he/she is keen on. The choice buyer or drivers have generally relies upon the price, safety, and how luxurious and how spaciuous the car is.
Car evaluation database is significant structure information that everyone should take a look at for the car features and useful in decision making. This dataset are labeled according to the specification of PRICE, COMFORT and SAFETY. The dataset utilized in this assignment can be access https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
The objective of this report is especially to determine the decision making, identifying the car variables like car price value with other various variable to decide between a good acceptable cars from the unaccepted values from the target value.
2. INTRODUCTION.
Understanding the idea in making a decision on a choice in getting a car is basic to everybody particularly the first time buyer or anyone who are inexperienced in how the car business functions. Generally we need a car as a methods for transportation however as we include fun into it and we tend to forget that we shouldn’t underestimate.
Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out more about our Essay Writing Service
Classifying a good car from a better than average to a terrible one are normally being finished physically with the assistance of car sales representative who guides us to purchase this along these lines or from the conclusion of our family and companions who had past experienced with vehicle inconveniences. It would have been better to have a device that can check car features and tell that it’s an X car or a Y car. If there is such device there should be no worries in purchasing a car.
In present times it is continuously the car sales representative who encourages us to purchase this car or not. We may or probably won’t know it consciously however we are basically ignoring the factors that would help us financially, comfortably, and safety in a long run.
In this assignment we process the data, exploring the variables relationship between the attributes and we model the data from different classification models, those are K nearest neighbor and Decision trees in terms of their best set of parameter for each case and performance on car evaluation data set.
2.1 Dataset Attributes
The data set that we accesed from the UCI respository which is collection of observation of the specified attributes of a car, it was donated by Marco Bohanec in 1997.
The Car Evaluation dataset contains following concept structure:
CAR- car acceptability.
PRICE overall price
buying -buying price
maint -price of the maintenance
TECH- technical characteristics
COMFORT- comfort
doors -number of doors
persons- capacity in terms of persons to carry
lug_boot- the size of luggage boot
safety- estimated safety of the car
2.1. A. Do we need all the Variables?
Getting rid of unnecessary variables is a good initial step when managing with any data set, since dropping attributes diminishes complexity nature and can make calculation on the data set quicker. Regardless of whether we should dispose an attributes or not will rely upon size of the data set and the goal of our investigation however in any case be useful to drop variables that will only distract from the aim and goal of the assignment.
The car directly relay on six attributes : ‘buying’,’maint’,’doors’,’persons’,’lug_boot’,’safety’,’classes’.
The dataset contains 1727 instance and possible values each attribute are below
INPUT Attributes
buying |
vhigh, high, med,low |
maint |
Vhigh,high, med, low |
doors |
2, 3, 4, 5more |
persons |
2, 4, more |
lug_boot |
Small, med, big |
safety |
Low, med, high |
Table 1: attributes values
Missing attribute values: The Car Evaluation doesn’t contain any missing values- NONE
2.1. B. TARGET ATTRIBUTE- classes
The data analysis is done on this dataser to identify some patterns and also attribues range with their Percentages(frequency).
classes |
Number of observation per class |
Percentage |
unacc |
1209 |
70.023 |
acc |
384 |
22.222% |
good |
69 |
3.993% |
vgood |
65 |
3.762% |
Table 2: Class Distribution
The target variable classes indicates whether each car is unacc, acc, good, vgood , since predicting analysis is our goal
–
Attribute Characteristics: Categorical
Associated Tasks : Classification task to acquire the knowledge from the data set
Based on the distribution , from the table it looks like more no of instances are in unacc classes which means its skewed data, so it means we have choosen right task(classfication) to analyse this distribution.
3. Methodology
3.1 data collection
The Car Evaluation Dataset is selected from UCI Machine learning repository for this assignment. This dataset contains 1727 instance an d 6 attributes. We are importing necessary pandas modules to the read the car evaluation data set from our system drive.
3.2 data preprocessing
The dataset from UCI respository has be cleaned and it is standard quality before the module analysis is proceed.Data set often contains missing values and extreme values called outliers and this values can effect out test and even sometimes it even can cause module to fail. It better to remove all outliers and fill the missing values with near values. In our dataset , we don’t hav any missing values or any sort of outliers.
buying 0
maint 0
doors 0
persons 0
lug_boot 0
safety 0
classes 0
dtype: int64
From the above output, detecting the missing values is easy task more over it is difficult to decide how to handle missing values, missing values in categorical data set are not troubling because we can treat them as NA on the other hand missing vales in numerical variables will cause troublesome to our analysis. Before we start up , data cleaning is done.
Before starting up the analysis it’s a good idea to start off by checking the dimensiona of our dataset by checking the share and description of the variables.
Exploring the attributes and variables .
The initial step in data exploratory analysis is reading the data information and then exploring the attributes factors. It is essential to get a sense of how many variables and cases are, and there attributes datatypes, the possible range of values that attributes take on.
Transforming the Variables (Data transformation)
When we first load the dataset , few variables may be encoded as datatypes and they doesnt fit well in our dataset for example Classes variable(Target variable) that indicates the Unacceptable, acceptable, good and very good that only takes the values like 1, 2, 3 and 4
Most of the variables are encoded as object type and in this data analysis all the variable holding a categorical variables and the variables are in string format, to go further operation we need to change the String type to integer type, more over this models requires the variables to be in integers and we have converted by giving specified number to each variable (encoding).
3.3 Data exploration
Data exploration is a technique similar to data analysis where data is summarized in visual exploration and the characteristics of data.
The data exploration includes following
i. Univariate
Exploration and analyze of each variable
ii. Bivariate
Exploration and analyze pair of variables and their relationship
iii. Multivariate
Exploration of multiple variables in the data set
Here we will check each feature with the class in the distribution.
Above graph which give the number of count (unique values in the column) vs the classes
From the give graph result almost 70% of cars are in classes unacceptable(unacc), which means it skewed left distribution.
In the above graph, from the out of total 1727 instances of car in the datset 1209(70%) were unacceptable, 384(22%) were acceptable, 69(3.9%) were in good and 65(3.7%) are in very good. From the grpagh we can com to conclusion that more then half of the cars evaluated were not in acceptable.
Buying histogram tells -are the distribution of the classes trend to be uniformly distributed , while very high and high buying cost of a car will probably made a car be unaccepted.
With the very high and high maintenance price will probably made a car be to unaccepted
Here,distribution of each classes tend to be uniformly distributed whereas 2 doors will effect a car to be in unaccepted classes.
In this persons distribution of the classes, with 2 persons captity of the car it will be unaccepted
In this luggage boot space in the car ,where small luggage boot is casing the car to be in unaccepted.
In this safety distribution of the each classes it is seen that normal distribution and low safety will most likely caused a car being unaccepted.
Where in measurement view of normal distribution is a nature in real-world cases practically, then we can decide safety is the most important features in our module analysis
3.4 Splitting of dataset and randomized
Training and Testing
X- Data frame containing input data
Y – Output data / result which has to be predicated
In this assignment, we have divided the dataset into training set and testing set and the 3 splits used in this assignment are show in the table ab.
Training and Testing % split |
50% 50% |
60% 40% |
80% 20% |
Table ab: Training and Testing Split
3.5 data modeling(classification)
The experiment is carried on using the classifiers models, those are K-nearest neighbors and Decision trees. This experiment is to determine which classifier best suits for our data set in terms of classifying the trained and tested set and also making prediction module obtained during the training process. The detailed procedure of the experiment is below
K-nearest neighbors
K-NN is a classifier which just finds the classes of the k-nearest neighbors (based on a distance metric the shortest distance between the samples which is known as Euclidean) and then find the classes in the larger part and assign that class to the test pattern. In Here we have started comparing 3 splits of classes
Knn module is a technique of learning where a particular instance is mapped against many labels. Here we are pre-specifying the labels to train our module.
N_neighbour and power parameter, p=1 or 2
Decision trees
Decision tree is a module that uses a tree-like-graph or module of condition of decisions and their possible consequences. It is one approach to display an algorithm that contains only conditional control statements.
It follows a flowchart like structure in each internal node that is condition on each attribute, each branch represents the outcome of the condition, and each leaf node represents a class table. The top down approach from the root to the leaf represents classification rule.
Root Node:
This Node represents the total population (instances) and furthers breakdown into branches class sub-nodes based the conditions.
Decision node
When a sub node gets divided into further sub nodes then its called decision node
Leaf node
When node cannot spilt further into sub nodes
3.5 Accuracy Test
Accuracy: The measurement of correct classifications / the total amount of classifications.
Train accuracy: The accuracy of a model on samples it was constructed on.
Test accuracy: The accuracy of a model on samples it hasn’t seen.
The accuracy is tested on each splited data set from the table ab and report is printed to compare which model suits best.
Result
The presentation of the results is based on following model analysis
a.Classification
Knn
Confusion matrix
Below the output of 80-20split of the data set confusion matric for ou knn model.
Confusion Matrix:
[[231 2 1 0]
[ 14 73 0 0]
[ 1 1 10 0]
[ 0 2 1 10]]
Knn is fit good enough on the test set where only 22 instances are missed .
To get better understand of the module we alos used another measurement precision, recall,f1 score.
Given the result of measurements of the assignment 80:20 split for the KNN module
precision: 0.9270637898686679
recall: 0.8572060123784262
f1 score: 0.8875617588932807
F1 is combination of precision and recall then we can say that f1 score is used to measure our model performance.
IN this analysis our splitting ration of knn 80:20 seems good enough at its performance
The accurancy acived for different dataset slipt are presented in the table c , b
KNN |
|
Splitting Percentage (Training % and Testing %) |
80-20 |
n_neighbors |
7 |
Power variable (p) |
2 |
Testing accuracy |
93.64161849710982% |
Classification error rate |
6.358381502890175% |
Confusion Matrix |
[[231 2 1 0] [ 14 73 0 0] [ 1 1 10 0] [ 0 2 1 10]] |
precision |
0.9270637898686679 |
recall |
0.8572060123784262 |
f1 score: |
0.8875617588932807 |
Table c: CLASSIFICATION OF KNN
Decision tree
Decision trees max_depth=6 |
|
Splitting Percentage (Training % and Testing %) |
80-20 |
Testing accuracy |
93.64161849710982%
|
Classification error rate |
6.358381502890175% |
Confusion Matrix |
[[224 9 1 0] [ 1 78 7 1] [ 0 0 11 1] [ 0 2 0 11]]
|
precision |
0.8242653161281193 |
recall |
0.9041592985558503
|
f1 score |
0.8545574400650303 |
Table D: CLASSIFICATION OF DECISION TREE
Discussions
- The spliitng of data set from the two comparison shows that K nearest neighbor and decision tree have exctally accuracy across the data slept ratio (80:20) 93.64%.
- F1 score is combination of precision and recall then we can say that f1 score is used to measure our model performance.
- So, here Knn classifier model have higher f1 score than compared to Decision tree.
- To provide a distinction between two classifers and their performance ,comparing the results of two classifers are show in the table x, table y under classification ; it is observed that knn classifer best suist for our data set.
- Also, it is seen that decision trees has less f1 score even though knn and decision tree have same accurancy.
Conclusion
The comparison analysis for the classiers used in this assignment show that K nearest neighbor and Decision tree have sam performanc in terms of accurancy.
However, in terms of f1 score k nearest neighbor seems to be best compared to decision tree.
Result table
K nearest neighbor (knn)
KNN |
|||
Splitting Percentage (Training % and Testing %) |
50-50 |
60-40 |
80-20 |
n_neighbors |
5 |
5 |
7 |
Power variable (p) |
2 |
1 |
2 |
Testing accuracy |
91.20370370370371% |
92.61939218523878% |
93.64161849710982% |
Classification error rate |
8.79629629629629% |
7.38060781476122% |
6.358381502890175% |
Confusion Matrix |
[[577 6 3 0] [ 46 167 3 3] [ 4 3 21 1] [ 1 2 4 23]] |
[[459 4 2 0] [ 28 148 3 0] [ 3 5 15 0] [ 1 5 0 18]] |
[[231 2 1 0] [ 14 73 0 0] [ 1 1 10 0] [ 0 2 1 10]] |
precision |
0.8465658156996925 |
0.8996017827059918 |
0.9270637898686679 |
recall |
0.809500828387994 |
0.8040215824237817 |
0.8572060123784262 |
f1 score: |
0.8247259934493818 |
0.8457758780971122 |
0.8875617588932807 |
Table x:Accuracy results for K nearest neighbor
Result table
Decision trees
Decision trees max_depth=6 |
|
|
|
Splitting Percentage (Training % and Testing %) |
50-50 |
60-40 |
80-20 |
Testing accuracy |
92.24537037037037% |
93.19826338639653% |
93.64161849710982%
|
Classification error rate |
7.754629629629628%
|
6.80173661360347%
|
6.358381502890175% |
Confusion Matrix |
[[568 16 2 0] [ 16 179 21 3] [ 0 0 24 5] [ 0 4 0 26]]
|
[[449 14 2 0] [ 2 156 18 3] [ 0 0 19 4] [ 0 4 0 20]]
|
[[224 9 1 0] [ 1 78 7 1] [ 0 0 11 1] [ 0 2 0 11]]
|
precision |
0.7868611018471237 |
0.7800093405644288 |
0.8242653161281193 |
recall |
0.8702219370468116 |
0.8741300168982008 |
0.9041592985558503
|
f1 score |
0.8178696121130331 |
0.815354746873236 |
0.8545574400650303 |
Table y:Accuracy for Decision Tree
Here with the data split with 80-20 both the modules knn and dt hav same accuracy but accuracy cant be the fair criteria to determine unbalanced classification so lets check with f1 score and knn with 80-20hav heighest with 0.88 when compared to dt f1 score of 0.85
Conclusion
KNN module has height accuracy with split (80-20) than compared to Decision trees and it is the best suitable module for our data set with the following data parameters like n_neighbors = 7 and power variable , p = 2
We able to get testing accuracy of 93.64161849710982%
Dt
On based on splitting the data set which is 80:20 seems perform vey good
All the attributes plays a vital role for customers in assessing whether the car is in accepted or unaccepted class
Safety and person’s capacity are main factors in rejecting the car classes as unacceptable
No of doors plays no importance in deciding the classes of the car
References
- https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
- https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
- https://www.dataquest.io/blog/sci-kit-learn-tutorial/
- https://towardsdatascience.com/machine-learning-general-process-8f1b510bd8af
Cite This Work
To export a reference to this article please select a referencing stye below:
Related Services
View allDMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: