Python Machine Learning KNN Example from CSV data

In previous post Python Machine Learning Example (KNN), we used a movie catalog data which has the categories label encoded to 0s and 1s already. In this tutorial, let’s pick up a dataset example with raw value, label encode them and let’s see if we can get any interesting insights.

Machine Learning Books

Before we jump into the topic, if you are interested to learn more about Machine Learning, check out the books below on Amazon.

Data Source

Let’s use the dataset example from Kaggle.com. I picked the car dataset for this example.

Dateset Sample

https://www.kaggle.com/jingbinxu/sample-of-car-data/version/1

Data Clean Up and Prep

First, we need to clean up the data and prep them for us to be able to use it to apply the KNN.

Once we have the dataset download, first thing we need to do is generate a new dataset with 0s and 1s for the selected attributes.

There are several attributes in the dataset but I will be only using the following attributes.

  • fuel_type (2)
  • aspiration (2)
  • num_of_doors (3)
  • body_style (5)
  • drive_wheels (3)
  • engine_location (2)

The numbers next to attributes are the variations in that attribute. For example, there are two types of fuel. (gas and diesel), aspiration has std and turbo and so on.

Then, we will convert these variations to 0s and 1s.

pd.get_dummies() will check the variations it has in each attribute. It will automatically create a new column with attribute’s variation type.

For example, it will create new columns of fuel_type_gas and fuel_type_diesel. for the fuel_type attribute. If the entity has gas value of gas in original fuel_type attribute, it will add 1 value in fuel_type_gas and 0 value in fuel_type_diesel. It will apply the same action for the rest of the attributes. It will generate total of 17 columns.

The table will look something like below:

fuel_type_gasfuel_type_dieselaspiration_stdaspiration_turbo
entity 11010
entity 20101

Libraries will be used in this article

We will be using the following libraries. If you do not have those in your system, use pip install to install the libraries.

  • operator
  • pandas
  • numpy
  • scipy

pip install the libraries

pandas

pip install pandas

numpy

numpy library should be installed when you install pandas as its dependency. If you don’t have numpy, pip install as follows:

pip install numpy

scipy

pip install scipy

Load Dataset

Load the dataset it will be used in this article. It will be a CSV file with raw data. This dataset will be converted to pandas dataframe once it is loaded.

The sample data file can be downloaded here

pd_df = pd.read_csv('data/car_dataset.csv')

Create Dataframes

We will be creating two sets of pandas dataframes. One dataframe is for entity name and its ID. The other dataframe is for the attributes. After the attribute values converted to 0s and 1s, these two dataframes will be merged into one dataframe.

First Dataframe (pd_df0)

pd_df0 is a dataframe for entity name and it id. It will extract all rows for column 0 and 2 from loaded CSV file.

pd_df0 = pd_df.iloc[:, [0, 2]]

Second Dataframe (pd_df1)

pd_df1 will contain the attributes, it will extract the column 3, 4, 5, 6, 7 and 8.

pd_df1 = pd_df.iloc[:, [3, 4, 5, 6, 7, 8]]

Once these are extracted, now we will use pd.get_dummies(pd_df1) to convert the attribute values to 0s and 1s.

pd_df1 = pd.get_dummies(pd_df1)

Merged Dataframe (pd_df2)

Now the two dataframes will be merge into one dataframe using pd.concat().

pd_df2 = pd.concat([pd_df0, pd_df1], axis=1, sort=False)

Convert dataframe to Array

After we have the data we want to work on, the dataframe needs to be converted to an array format.

df_array = pd_df2.to_numpy()

Convert Array to set of Dictionary

The converted array now needed to be converted to form of dictionary.

carDict = {}

for d in df_array:
    carID = int(d[0])
    name = d[1]
    attributes = d[2:]
    attributes = map(int, attributes)
    carDict[carID] = (name, np.array(list(attributes)))

At this point, we have the object in the form we can work for applying the KNN Algorithm.

Applying KNN Algorithm

At this point, we can leverage the KNN algorithm we used in Python Machine Learning Example (KNN).

Prepare the getNeighors() and ComputeDistance()

getNeighbors()

def getNeighbors(carID, K):

    distances = []
    for car in carDict:
        if (car != carID):
            dist = ComputeDistance(carDict[carID], carDict[car])
            distances.append((car, dist))
    distances.sort(key=operator.itemgetter(1))

    neighbors = []
    for x in range(K):
        neighbors.append((distances[x][0], distances[x][1]))
    return neighbors

ComputeDistance()

def ComputeDistance(a, b):
    dataA = a[1]
    dataB = b[1]

    AttributeDistance = spatial.distance.cosine(dataA, dataB)

    return AttributeDistance

Comparison

Now, let’s take a look how the selected entity compares with the similar entities selected by the KNN algorithm.

Let’s say we use the entity ID 5. It has the following attributes.

modelfuel_typeaspirationnum_of_doorsbody_styledrive_wheelsengine_location
audi 5gasstdfoursedan4wdfront

The algorithm returned the following entities.

145 | subaru  145 | 0.0
4 | audi  4 | 0.16666666666666674
7 | audi  7 | 0.16666666666666674
12 | bmw  12 | 0.16666666666666674
14 | bmw  14 | 0.16666666666666674
15 | bmw  15 | 0.16666666666666674
16 | bmw  16 | 0.16666666666666674
18 | bmw  18 | 0.16666666666666674
21 | chevrolet  21 | 0.16666666666666674
26 | dodge  26 | 0.16666666666666674

Entity ID 145, subaru 145 has distance of 0. If we take a look at the subaru 145’s attribute, see below table.

modelfuel_typeaspirationnum_of_doorsbody_styledrive_wheelsengine_location
subaru 145gasstdfoursedan4wdfront

Which means, it was an exact match. If you take a look on the second nearest one was audi 4. If you take a look on the attributes.

modelfuel_typeaspirationnum_of_doorsbody_styledrive_wheelsengine_location
audi 4gasstdfoursedanfwdfront

Everything looks same except the drive_wheels.

As you can see from the result, KNN algorithm is pretty useful to identify the similar entity based on the given attributes.

If you want to learn more about Machine Learning, check out the books below at Amazon.

Sample Code

Latest Posts

Feel free to share this post!

Scroll to Top