Introduction to Python Machine Learning and Dealing with NaN Values in a Data Set using Pandas

Rahul Kotecha

Published in

Python in Plain English

5 min readJan 27, 2022

What is Machine Learning?

There are multiple ways to describe the concept of machine learning:

The term “machine learning” can be described as a branch of Computer science that gives computers the ability to learn without being explicitly programmed.
Machine learning is a process of extracting patterns or structures and making predictions using data.
Machine learning is a semi-automated extraction of knowledge from data. Knowledge of data means that the solution to a problem is within the data. Semi-automation because it requires some effort for data interpretation. In machine learning, we provide Input and define the desired output and then we will receive the generated Program which will bring us the desired output.

Before getting into the details of how Machine learning works we need to understand how to deal with “NaN” or “Not a Number” values in a data set.

Rules for dealing with NaN observations in a dataset

There are certain rules which need to be kept in mind while dealing with NaN values and they are as follows:

Rule 1: If the total number of NaN values in a column are less than 2% of the observations → Use .dropna() function to drop the rows.

Rule 2: If the total number of NaN observations in a column are between 3%- 40% → Use .fillna() function to fill the NaN observations.

Rule 3: If the total number of NaN observations in a column are greater than 40% → Use .drop() function to delete the entire column.

For any program we undertake, we are importing certain critical packages into are program- Numpy, Pandas, Matplotlib and Seaborn as:

**Importing Numpy, Pandas, Matplotlib and Seaborn**

Remembering these 3 key rules is not enough we need to remember and check certain other elements before implementing any of these functions. We will look at the example to better understand the implementation of each of these rules.

NaN values less than 2% →.dropna() function- The dropna() function is used to drop a specific row from the DataFrame.

Default syntax→df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Here axis can be 0 for row selection and not 1 because our aim is to drop only the rows using this function. The how condition by default is how=“any” which means that it will drop any row if any of the cells in that specific row are NaN. We can modify this to how=“all” without defining a subset and this will drop only the row if only all of the cells in that specific row are NaN. Now if we set the how condition as how=“all” and define a subset with a list of column names, this will drop any row if all of the cells of the subset column are NaN. Finally, the inplace condition is used to make the changes permanent.

**DataFrame to be used for the example**

**DataFrame information about dtype, sum of NaN values**

**Using dropna() → adjusted syntax how=”all”**

**Using dropna() → adjusted syntax how=”all” and subset defined**

2. NaN observations between 3%-40% of total observations → .fillna() function- Using the .fillna() function we can fill the NaN values in a data set. It is an optimal solution for data set with NaN observations lying anywhere between 3% and 40% of the total number of observations. The fillna() function can be undertaken by using mean or median value based on distribution checking or by means of its 4 key methods- backfill, bfill, ffill and pad. The default syntax of .fillna() is :

data.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None)

Example for using .fillna() function and filling NaN values with Mean or Median value on checking the distribution of observations.

**Using .describe() to get additional details about Mean and Median**

**Storing Median value of columns in variable x and y**

Example for using .fillna() function and filling NaN values using fillna methods → backfill, bfill, ffill and pad.

**Using backfill method to fill NaN values**

**Using bfill method to fill NaN values**

**Using ffill method to fill NaN values**

We can observe that the output for the backfill and the bfill method is same and so is the output of ffill and pad method of fillna() function. The question now arises which method to use and in which circumstance, well this depends on the underlying we are trying to find a solution to or the underlying perspective we are trying to bring into the spotlight.

3. NaN observations greater than 40% of total observations → .drop() function-The drop function helps to delete the entire column from the DataFrame and the inplace=True helps to make this change permanent. The example of using the .drop() function is as follows: