Introduction to Python Machine Learning and Dealing with NaN Values in a Data Set using Pandas

Rahul Kotecha
Python in Plain English
5 min readJan 27, 2022

--

Photo by Pietro Jeng on Unsplash

What is Machine Learning?

There are multiple ways to describe the concept of machine learning:

  1. The term “machine learning” can be described as a branch of Computer science that gives computers the ability to learn without being explicitly programmed.
  2. Machine learning is a process of extracting patterns or structures and making predictions using data.
  3. Machine learning is a semi-automated extraction of knowledge from data. Knowledge of data means that the solution to a problem is within the data. Semi-automation because it requires some effort for data interpretation. In machine learning, we provide Input and define the desired output and then we will receive the generated Program which will bring us the desired output.

Before getting into the details of how Machine learning works we need to understand how to deal with “NaN” or “Not a Number” values in a data set.

Rules for dealing with NaN observations in a dataset

There are certain rules which need to be kept in mind while dealing with NaN values and they are as follows:

Rule 1: If the total number of NaN values in a column are less than 2% of the observations → Use .dropna() function to drop the rows.

Rule 2: If the total number of NaN observations in a column are between 3%- 40% → Use .fillna() function to fill the NaN observations.

Rule 3: If the total number of NaN observations in a column are greater than 40% → Use .drop() function to delete the entire column.

For any program we undertake, we are importing certain critical packages into are program- Numpy, Pandas, Matplotlib and Seaborn as:

Importing Numpy, Pandas, Matplotlib and Seaborn

Remembering these 3 key rules is not enough we need to remember and check certain other elements before implementing any of these functions. We will look at the example to better understand the implementation of each of these rules.

  1. NaN values less than 2% →.dropna() function- The dropna() function is used to drop a specific row from the DataFrame.

Default syntax→df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Here axis can be 0 for row selection and not 1 because our aim is to drop only the rows using this function. The how condition by default is how=“any” which means that it will drop any row if any of the cells in that specific row are NaN. We can modify this to how=“all” without defining a subset and this will drop only the row if only all of the cells in that specific row are NaN. Now if we set the how condition as how=“all” and define a subset with a list of column names, this will drop any row if all of the cells of the subset column are NaN. Finally, the inplace condition is used to make the changes permanent.

Creating DataFrame
DataFrame to be used for the example
DataFrame information about dtype, sum of NaN values
Using dropna() → default syntax
Using dropna() → adjusted syntax how=”all”
Using dropna() → adjusted syntax how=”all” and subset defined

2. NaN observations between 3%-40% of total observations → .fillna() function- Using the .fillna() function we can fill the NaN values in a data set. It is an optimal solution for data set with NaN observations lying anywhere between 3% and 40% of the total number of observations. The fillna() function can be undertaken by using mean or median value based on distribution checking or by means of its 4 key methods- backfill, bfill, ffill and pad. The default syntax of .fillna() is :

data.fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None)

Example for using .fillna() function and filling NaN values with Mean or Median value on checking the distribution of observations.

Creating DataFrame
DataFrame to be used as an example
DataFrame information about dtype, sum of NaN values
Using .describe() to get additional details about Mean and Median
Storing Median value of columns in variable x and y
Output with no NaN values

Example for using .fillna() function and filling NaN values using fillna methods → backfill, bfill, ffill and pad.

DataFrame used as Example
Using backfill method to fill NaN values
Using bfill method to fill NaN values
Using ffill method to fill NaN values
Using pad method to fill NaN values

We can observe that the output for the backfill and the bfill method is same and so is the output of ffill and pad method of fillna() function. The question now arises which method to use and in which circumstance, well this depends on the underlying we are trying to find a solution to or the underlying perspective we are trying to bring into the spotlight.

3. NaN observations greater than 40% of total observations → .drop() function-The drop function helps to delete the entire column from the DataFrame and the inplace=True helps to make this change permanent. The example of using the .drop() function is as follows:

Creating DataFrame
DataFrame to be used as an Example
DataFrame information about dtype, sum of NaN values
Using drop() function to drop column Machine Learning
Output after droping column Machine Learning

More content at plainenglish.io. Sign up for our free weekly newsletter. Get exclusive access to writing opportunities and advice in our community Discord.

--

--

Student at Stevens Institute of Technology- Masters of Science in Information Systems with a concentration in Business Intelligence and Analytics