Meet Pandas-Profiling: A Python Library for Data Analysis

How to use pandas-profiling for data analysis

R. Gupta
Python in Plain English

--

In the last notebook, we have seen some methods in pandas that can help us in analyzing data. If you have not checked out the last article, you can go and check that article before proceeding here, although it is not mandatory. Here we will discuss another library in Python which works on top of the pandas library. The name of this library is pandas-profiling. You can easily install this library in your Python environment by typing this command in the terminal- pip install pandas-profiling.

source

This pandas-profiling library provides you with a method to generate the analysis report of the given data frame. This generated report can also be saved as separate HTML and JSON files. It provides a descriptive analysis of any dataset which is loaded in a data frame using pandas. This really saves you from writing lots of code. Within a minute, you get the analysis report for your whole datasets.

Run these commands into Google Colab notebook or in Jupyter notebook on the local system to install these libraries using these commands:

!pip install pandas
!pip install pandas-profiling

Once the libraries are installed, you need to load one dataset on which we can generate the analysis report using pandas-profiling library.

After installing pandas-profiling, we would load one dataset. Here we will be using House Prices — Advanced Regression Techniques dataset. We will see how can we perform an analysis on this dataset using pandas-profiling. You can download this dataset from Kaggle by clicking here. In this dataset, a total of 81 features are given including the target feature SalePrize. Total input features are 80. The task is to determine SalePrize using 80 features. First, we will download the train.csv file of the dataset from Kaggle and upload it to Google Colab and load this CSV file into the pandas data frame using the below code:

from google.colab import filesimport pandas as pd# this one line function will let you upload file from your local #system to google colabuploaded = files.upload()# train.csv is the name of the file that has been uploaded from the #local systemdf = pd.read_csv("train.csv")
df.describe()
Output of df.describe()

Note: Although df.describe() method of pandas gives a descriptive analysis of features It is not as convenient as pandas-profiling ProfileReport. Now we will run pandas_profiling.ProfileReport(pandas.DataFrame) on the above-loaded data frame. It may take a couple of minutes if run on the CPU.

from pandas_profiling import ProfileReport
reportGenerated = ProfileReport(df)
reportGenerated

You can also save this report either in HTML or in JSON format using the below commands:

reportGenerated.to_file("Analysis.html")
reportGenerated.to_file("Analysis.json")
The file is saved in Analysis.html and Analysis.json format.

From Colab, you can download these files and open them into your local system. However, there is another method to open a profiling report in a notebook using, but currently, the Google Colab notebook is not supporting it. You can try to_widgets() in your local system in Jupyter notebook.

Here is a GIF image of how the generated report looks. You can open the saved HTML file in your browser by double-clicking on it.

Generated Report

In this file, in the overview, the details of variables counts, number of observations, duplicate rows, and missing values are given. After that for each feature present in the dataset, a highly descriptive analysis is shown like how many distinct values are there, how many missing values are there, how much they contribute to the total missing values in the dataset, Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range and also the descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness. You can also click on the “toggle details” button to see more details.

It is really very handy tool to have the descriptive analysis of any dataset.

Some other variations of the above function are shown here:

#You can give the title to the report also.
reportGenerated = ProfileReport(df, title = "Pandas Profiling Report")
# You can also specify number of bins for the drawn histograms
profile = df.profile_report(
title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")

You can read more about the pandas-profiling project on this GitHub link where all the recent developments in this package can be seen.

I hope you enjoyed reading this article. If you liked the article, you can give a clap and follow me on medium also. Thanks for your valuable time. Stay tuned for upcoming articles.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.

--

--

I am interested in learning new technology. Interested in Programming, AI, Data Science and Networking. Love to explore new places.