Dask Example for Data Science in Python

Python Dask Example on Large Data Set

Python is a very useful language to process data. Using Pandas library can easily manipulate the data such as sorting the data, getting the top 5, etc. However, if you are dealing with large data sets (bigger than your machine memory) you will hit the roadblock on out of memory problem. Since Pandas is processing the data in-memory, it can only process the data smaller than the memory of the machine you are using to process the data.

In the scenario of processing the data larger than the machine memory, you need to load the data in a chunk that the data is smaller than your machine memory size. You can do this manually or you can use the Python library such as Dask library to do the chunking under the hood for you.

What is Dask?

Dask is a flexible library for parallel computing in Python. It is similar to Pandas it manipulates data but it can handle larger than the machine memory.

Example Use Case of using Dask

In this example, it will return the first n rows by columns in descending order. In this particular example, it will sort by ‘NUM_VALUE’ column. Below is the sample code. I will explain how things work in this code.

	# using dask to handle the big dataset
	import dask.dataframe as dd

	# use numpy to conver the dask dataframe object to array object
	import numpy as np

	# load the dataset file
	file_name = 'sample_data_set.csv'

	#row_count to output
	row_count = int(10)

	# read the csv file and convert it to dask dataframe
	df = dd.read_csv(file_name, error_bad_lines=False)

	# using nlargest to return the first n rows ordered by columns in descending order.
	# in this case, use 'NUM_VALUE' column as a column to order by
	# reference: http://docs.dask.org/en/latest/dataframe-api.html?highlight=nlargest#dask.dataframe.DataFrame.nlargest
	df2 = df.nlargest(row_count, 'NUM_VALUE')

	# get the UIDs of n rows
	l = df2['UID'].values

	# convert the extracted UIDs to array object
	n = np.array(l)

	# output the UIDs
	for x in n:
	print(x)

view raw python-dask-example.py hosted with ❤ by GitHub

If you don’t have Dask library in your system, you need to install that library first. In this example, we only need to use dataframe functionality thus, you can just install the Dask library as below:

pip3 install dask[dataframe]

If you want to install the whole Dask library, you can install by using [complete]

pip3 install dask[complete]

Please refer to official document for how to install Dask library in your system.

Now let’s dig into the sample code.

L2: import dask.dataframe as dd

this will declare that we need dask.dataframe in this code and we will be using this as dd (shorthand).

L5: import numpy as np

Line 5 will import numpy library. Since in this example, it will convert the Dask dataframe into array object once sorting is done.

L8: file_name = 'sample_data_set.csv'

Assign the file you want to load and process. Use this sample script to generate a huge amount of data.

L11: row_count = int(10)

Here, you can assign how many rows you want to pick once it sorted out.

L14: df = dd.read_csv(file_name,  error_bad_lines=False)

This line will read the csv file. error_bad_lines is set to False in this example. This means, it will drop that particular line if it is a “bad line”

L19: df2 = df.nlargest(row_count, 'NUM_VALUE')

It will be sorting the dataset in descending order by NUM_VALUE column with assigned number of rows set in “row_count”.

L22: l = df2['UID'].values

The dataframe has “UID” column. It will get the value in the “UID” column.

L25: n = np.array(l)

Using numpy’s np.array(), it will convert the Dask dataframe object to Python’s array object. And in the the for loop, you can print out the array object if you want to verify the values.

You want to learn more about Data Science using Python? Check out some books at Amazon.