Python Dask Example on Large Data Set
Python is a very useful language to process data. Using Pandas library can easily manipulate the data such as sorting the data, getting the top 5, etc. However, if you are dealing with large data sets (bigger than your machine memory) you will hit the roadblock on out of memory problem. Since Pandas is processing the data in-memory, it can only process the data smaller than the memory of the machine you are using to process the data.
In the scenario of processing the data larger than the machine memory, you need to load the data in a chunk that the data is smaller than your machine memory size. You can do this manually or you can use the Python library such as Dask library to do the chunking under the hood for you.
What is Dask?
Dask is a flexible library for parallel computing in Python. It is similar to Pandas it manipulates data but it can handle larger than the machine memory.
Example Use Case of using Dask
In this example, it will return the first n rows by columns in descending order. In this particular example, it will sort by ‘NUM_VALUE’ column. Below is the sample code. I will explain how things work in this code.
# using dask to handle the big dataset | |
import dask.dataframe as dd | |
# use numpy to conver the dask dataframe object to array object | |
import numpy as np | |
# load the dataset file | |
file_name = 'sample_data_set.csv' | |
#row_count to output | |
row_count = int(10) | |
# read the csv file and convert it to dask dataframe | |
df = dd.read_csv(file_name, error_bad_lines=False) | |
# using nlargest to return the first n rows ordered by columns in descending order. | |
# in this case, use 'NUM_VALUE' column as a column to order by | |
# reference: http://docs.dask.org/en/latest/dataframe-api.html?highlight=nlargest#dask.dataframe.DataFrame.nlargest | |
df2 = df.nlargest(row_count, 'NUM_VALUE') | |
# get the UIDs of n rows | |
l = df2['UID'].values | |
# convert the extracted UIDs to array object | |
n = np.array(l) | |
# output the UIDs | |
for x in n: | |
print(x) |
If you don’t have Dask library in your system, you need to install that library first. In this example, we only need to use dataframe functionality thus, you can just install the Dask library as below:
pip3 install dask[dataframe]
If you want to install the whole Dask library, you can install by using [complete]
pip3 install dask[complete]
Please refer to official document for how to install Dask library in your system.
Now let’s dig into the sample code.
L2: import dask.dataframe as dd
this will declare that we need dask.dataframe in this code and we will be using this as dd (shorthand).
L5: import numpy as np
Line 5 will import numpy library. Since in this example, it will convert the Dask dataframe into array object once sorting is done.
L8: file_name = 'sample_data_set.csv'
Assign the file you want to load and process. Use this sample script to generate a huge amount of data.
L11: row_count = int(10)
Here, you can assign how many rows you want to pick once it sorted out.
L14: df = dd.read_csv(file_name, error_bad_lines=False)
This line will read the csv file. error_bad_lines is set to False in this example. This means, it will drop that particular line if it is a “bad line”
L19: df2 = df.nlargest(row_count, 'NUM_VALUE')
It will be sorting the dataset in descending order by NUM_VALUE column with assigned number of rows set in “row_count”.
L22: l = df2['UID'].values
The dataframe has “UID” column. It will get the value in the “UID” column.
L25: n = np.array(l)
Using numpy’s np.array(), it will convert the Dask dataframe object to Python’s array object. And in the the for loop, you can print out the array object if you want to verify the values.
You want to learn more about Data Science using Python? Check out some books at Amazon.
- How to convert MD (markdown) file to PDF using Pandoc on macOS Ventura 13
- How to make MD (markdown) document
- How to Install Docker Desktop on mac M1 chip (Apple chip) macOS 12 Monterey
- How to install MySQL Workbench on macOS 12 Monterey mac M1 (2021)
- How to install MySQL Community Server on macOS 12 Monterey (2021)