Python Machine Learning KNN Example from CSV data

In previous post Python Machine Learning Example (KNN), we used a movie catalog data which has the categories label encoded to 0s and 1s already. In this tutorial, let’s pick up a dataset example with raw value, label encode them and let’s see if we can get any interesting insights.

Table of Contents

Machine Learning Books

Before we jump into the topic, if you are interested to learn more about Machine Learning, check out the books below on Amazon.

Data Source

Let’s use the dataset example from Kaggle.com. I picked the car dataset for this example.

Dateset Sample

https://www.kaggle.com/jingbinxu/sample-of-car-data/version/1

Data Clean Up and Prep

First, we need to clean up the data and prep them for us to be able to use it to apply the KNN.

Once we have the dataset download, first thing we need to do is generate a new dataset with 0s and 1s for the selected attributes.

There are several attributes in the dataset but I will be only using the following attributes.

fuel_type (2)
aspiration (2)
num_of_doors (3)
body_style (5)
drive_wheels (3)
engine_location (2)

The numbers next to attributes are the variations in that attribute. For example, there are two types of fuel. (gas and diesel), aspiration has std and turbo and so on.

Then, we will convert these variations to 0s and 1s.

pd.get_dummies() will check the variations it has in each attribute. It will automatically create a new column with attribute’s variation type.

For example, it will create new columns of fuel_type_gas and fuel_type_diesel. for the fuel_type attribute. If the entity has gas value of gas in original fuel_type attribute, it will add 1 value in fuel_type_gas and 0 value in fuel_type_diesel. It will apply the same action for the rest of the attributes. It will generate total of 17 columns.

The table will look something like below:

	fuel_type_gas	fuel_type_diesel	aspiration_std	aspiration_turbo
entity 1	1	0	1	0
entity 2	0	1	0	1

Libraries will be used in this article

We will be using the following libraries. If you do not have those in your system, use pip install to install the libraries.

operator
pandas
numpy
scipy

pip install the libraries

pandas

pip install pandas

numpy

numpy library should be installed when you install pandas as its dependency. If you don’t have numpy, pip install as follows:

pip install numpy

scipy

pip install scipy

Load Dataset

Load the dataset it will be used in this article. It will be a CSV file with raw data. This dataset will be converted to pandas dataframe once it is loaded.

The sample data file can be downloaded here

pd_df = pd.read_csv('data/car_dataset.csv')

Create Dataframes

We will be creating two sets of pandas dataframes. One dataframe is for entity name and its ID. The other dataframe is for the attributes. After the attribute values converted to 0s and 1s, these two dataframes will be merged into one dataframe.

First Dataframe (pd_df0)

pd_df0 is a dataframe for entity name and it id. It will extract all rows for column 0 and 2 from loaded CSV file.

pd_df0 = pd_df.iloc[:, [0, 2]]

Second Dataframe (pd_df1)

pd_df1 will contain the attributes, it will extract the column 3, 4, 5, 6, 7 and 8.

pd_df1 = pd_df.iloc[:, [3, 4, 5, 6, 7, 8]]

Once these are extracted, now we will use pd.get_dummies(pd_df1) to convert the attribute values to 0s and 1s.

pd_df1 = pd.get_dummies(pd_df1)

Merged Dataframe (pd_df2)

Now the two dataframes will be merge into one dataframe using pd.concat().

pd_df2 = pd.concat([pd_df0, pd_df1], axis=1, sort=False)

Convert dataframe to Array

After we have the data we want to work on, the dataframe needs to be converted to an array format.

df_array = pd_df2.to_numpy()

Convert Array to set of Dictionary

The converted array now needed to be converted to form of dictionary.

carDict = {}

for d in df_array:
    carID = int(d[0])
    name = d[1]
    attributes = d[2:]
    attributes = map(int, attributes)
    carDict[carID] = (name, np.array(list(attributes)))

At this point, we have the object in the form we can work for applying the KNN Algorithm.

Applying KNN Algorithm

At this point, we can leverage the KNN algorithm we used in Python Machine Learning Example (KNN).

Prepare the `getNeighors()` and `ComputeDistance()`

getNeighbors()

def getNeighbors(carID, K):

    distances = []
    for car in carDict:
        if (car != carID):
            dist = ComputeDistance(carDict[carID], carDict[car])
            distances.append((car, dist))
    distances.sort(key=operator.itemgetter(1))

    neighbors = []
    for x in range(K):
        neighbors.append((distances[x][0], distances[x][1]))
    return neighbors

ComputeDistance()

def ComputeDistance(a, b):
    dataA = a[1]
    dataB = b[1]

    AttributeDistance = spatial.distance.cosine(dataA, dataB)

    return AttributeDistance

Comparison

Now, let’s take a look how the selected entity compares with the similar entities selected by the KNN algorithm.

Let’s say we use the entity ID 5. It has the following attributes.

model	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location
audi 5	gas	std	four	sedan	4wd	front

The algorithm returned the following entities.

145 | subaru  145 | 0.0
4 | audi  4 | 0.16666666666666674
7 | audi  7 | 0.16666666666666674
12 | bmw  12 | 0.16666666666666674
14 | bmw  14 | 0.16666666666666674
15 | bmw  15 | 0.16666666666666674
16 | bmw  16 | 0.16666666666666674
18 | bmw  18 | 0.16666666666666674
21 | chevrolet  21 | 0.16666666666666674
26 | dodge  26 | 0.16666666666666674

Entity ID 145, subaru 145 has distance of 0. If we take a look at the subaru 145’s attribute, see below table.

model	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location
subaru 145	gas	std	four	sedan	4wd	front

Which means, it was an exact match. If you take a look on the second nearest one was audi 4. If you take a look on the attributes.

model	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location
audi 4	gas	std	four	sedan	fwd	front

Everything looks same except the drive_wheels.

As you can see from the result, KNN algorithm is pretty useful to identify the similar entity based on the given attributes.

If you want to learn more about Machine Learning, check out the books below at Amazon.

Python Machine Learning KNN Example from CSV data

Machine Learning Books

Data Source

Dateset Sample

Data Clean Up and Prep

Libraries will be used in this article

pip install the libraries

pandas

numpy

scipy

Load Dataset

Create Dataframes

First Dataframe (pd_df0)

Second Dataframe (pd_df1)

Merged Dataframe (pd_df2)

Convert dataframe to Array

Convert Array to set of Dictionary

Applying KNN Algorithm

Prepare the `getNeighors()` and `ComputeDistance()`

getNeighbors()

ComputeDistance()

Comparison

Sample Code

Latest Posts

Machine Learning Books

Data Source

Dateset Sample

Data Clean Up and Prep

Libraries will be used in this article

pip install the libraries

pandas

numpy

scipy

Load Dataset

Create Dataframes

First Dataframe (pd_df0)

Second Dataframe (pd_df1)

Merged Dataframe (pd_df2)

Convert dataframe to Array

Convert Array to set of Dictionary

Applying KNN Algorithm

Prepare the getNeighors() and ComputeDistance()

getNeighbors()

ComputeDistance()

Comparison

Sample Code

Latest Posts

Prepare the `getNeighors()` and `ComputeDistance()`