In previous post Python Machine Learning Example (KNN), we used a movie catalog data which has the categories label encoded to 0s and 1s already. In this tutorial, let’s pick up a dataset example with raw value, label encode them and let’s see if we can get any interesting insights.
Table of Contents
Machine Learning Books
Before we jump into the topic, if you are interested to learn more about Machine Learning, check out the books below on Amazon.
Data Source
Let’s use the dataset example from Kaggle.com. I picked the car dataset for this example.
Dateset Sample
https://www.kaggle.com/jingbinxu/sample-of-car-data/version/1
Data Clean Up and Prep
First, we need to clean up the data and prep them for us to be able to use it to apply the KNN.
Once we have the dataset download, first thing we need to do is generate a new dataset with 0s and 1s for the selected attributes.
There are several attributes in the dataset but I will be only using the following attributes.
- fuel_type (2)
- aspiration (2)
- num_of_doors (3)
- body_style (5)
- drive_wheels (3)
- engine_location (2)
The numbers next to attributes are the variations in that attribute. For example, there are two types of fuel. (gas and diesel), aspiration has std and turbo and so on.
Then, we will convert these variations to 0s and 1s.
pd.get_dummies()
will check the variations it has in each attribute. It will automatically create a new column with attribute’s variation type.
For example, it will create new columns of fuel_type_gas
and fuel_type_diesel
. for the fuel_type attribute. If the entity has gas value of gas
in original fuel_type
attribute, it will add 1
value in fuel_type_gas
and 0
value in fuel_type_diesel. It will apply the same action for the rest of the attributes. It will generate total of 17 columns.
The table will look something like below:
fuel_type_gas | fuel_type_diesel | aspiration_std | aspiration_turbo | |
entity 1 | 1 | 0 | 1 | 0 |
entity 2 | 0 | 1 | 0 | 1 |
Libraries will be used in this article
We will be using the following libraries. If you do not have those in your system, use pip install to install the libraries.
- operator
- pandas
- numpy
- scipy
pip install the libraries
pandas
pip install pandas
numpy
numpy library should be installed when you install pandas as its dependency. If you don’t have numpy, pip install as follows:
pip install numpy
scipy
pip install scipy
Load Dataset
Load the dataset it will be used in this article. It will be a CSV file with raw data. This dataset will be converted to pandas dataframe once it is loaded.
The sample data file can be downloaded here
pd_df = pd.read_csv('data/car_dataset.csv')
Create Dataframes
We will be creating two sets of pandas dataframes. One dataframe is for entity name and its ID. The other dataframe is for the attributes. After the attribute values converted to 0s and 1s, these two dataframes will be merged into one dataframe.
First Dataframe (pd_df0)
pd_df0 is a dataframe for entity name and it id. It will extract all rows for column 0 and 2 from loaded CSV file.
pd_df0 = pd_df.iloc[:, [0, 2]]
Second Dataframe (pd_df1)
pd_df1 will contain the attributes, it will extract the column 3, 4, 5, 6, 7 and 8.
pd_df1 = pd_df.iloc[:, [3, 4, 5, 6, 7, 8]]
Once these are extracted, now we will use pd.get_dummies(pd_df1)
to convert the attribute values to 0s and 1s.
pd_df1 = pd.get_dummies(pd_df1)
Merged Dataframe (pd_df2)
Now the two dataframes will be merge into one dataframe using pd.concat()
.
pd_df2 = pd.concat([pd_df0, pd_df1], axis=1, sort=False)
Convert dataframe to Array
After we have the data we want to work on, the dataframe needs to be converted to an array format.
df_array = pd_df2.to_numpy()
Convert Array to set of Dictionary
The converted array now needed to be converted to form of dictionary.
carDict = {}
for d in df_array:
carID = int(d[0])
name = d[1]
attributes = d[2:]
attributes = map(int, attributes)
carDict[carID] = (name, np.array(list(attributes)))
At this point, we have the object in the form we can work for applying the KNN Algorithm.
Applying KNN Algorithm
At this point, we can leverage the KNN algorithm we used in Python Machine Learning Example (KNN).
Prepare the getNeighors()
and ComputeDistance()
getNeighbors()
def getNeighbors(carID, K):
distances = []
for car in carDict:
if (car != carID):
dist = ComputeDistance(carDict[carID], carDict[car])
distances.append((car, dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(K):
neighbors.append((distances[x][0], distances[x][1]))
return neighbors
ComputeDistance()
def ComputeDistance(a, b):
dataA = a[1]
dataB = b[1]
AttributeDistance = spatial.distance.cosine(dataA, dataB)
return AttributeDistance
Comparison
Now, let’s take a look how the selected entity compares with the similar entities selected by the KNN algorithm.
Let’s say we use the entity ID 5. It has the following attributes.
model | fuel_type | aspiration | num_of_doors | body_style | drive_wheels | engine_location |
audi 5 | gas | std | four | sedan | 4wd | front |
The algorithm returned the following entities.
145 | subaru 145 | 0.0
4 | audi 4 | 0.16666666666666674
7 | audi 7 | 0.16666666666666674
12 | bmw 12 | 0.16666666666666674
14 | bmw 14 | 0.16666666666666674
15 | bmw 15 | 0.16666666666666674
16 | bmw 16 | 0.16666666666666674
18 | bmw 18 | 0.16666666666666674
21 | chevrolet 21 | 0.16666666666666674
26 | dodge 26 | 0.16666666666666674
Entity ID 145, subaru 145 has distance of 0. If we take a look at the subaru 145’s attribute, see below table.
model | fuel_type | aspiration | num_of_doors | body_style | drive_wheels | engine_location |
subaru 145 | gas | std | four | sedan | 4wd | front |
Which means, it was an exact match. If you take a look on the second nearest one was audi 4. If you take a look on the attributes.
model | fuel_type | aspiration | num_of_doors | body_style | drive_wheels | engine_location |
audi 4 | gas | std | four | sedan | fwd | front |
Everything looks same except the drive_wheels
.
As you can see from the result, KNN algorithm is pretty useful to identify the similar entity based on the given attributes.
If you want to learn more about Machine Learning, check out the books below at Amazon.
Sample Code
Latest Posts
- How to convert MD (markdown) file to PDF using Pandoc on macOS Ventura 13
- How to make MD (markdown) document
- How to Install Docker Desktop on mac M1 chip (Apple chip) macOS 12 Monterey
- How to install MySQL Workbench on macOS 12 Monterey mac M1 (2021)
- How to install MySQL Community Server on macOS 12 Monterey (2021)