Data Exploration with Pandas and Matplolib

Exploring an Insurance dataset by using Pandas for dataframes and Matplotlib to produce graphs for the analysis

Angel Mariano
Python in Plain English

--

TOOLS

Pandas: A fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

In this article, we will be exploring an Insurance dataset by using Pandas for dataframes and Matplotlib to produce graphs for the analysis.

The data that we will be analyzing contains several variables such as age, sex, children, smoker, region, and charges. Through this analysis, we can determine the relationship of several factors on insurance charges.

We can check these Basic Information about the data using Pandas:

Here, I have imported the libraries first, pandas and matplotlib. Then read the insurance file.

To check the top and the last part of the data:

To check other info’s like the data type, we can use .info( ) , and use
.describe( ) for statistical estimates:

Important details to also check are the count of each variables:

Before proceeding with the analysis, I have checked if there are null values, and by using seaborn we can see from the graph below that there are no missing values in the data.

Exploratory Data Analysis

From the distribution graph above, we can see that the charges were mostly saturated around 10,000 below.

Here, I have analyzed each variables that could have an effect to the Charges:

By Age:

From the graph, we can definitely see that the charges increase with respect to age. We can also see that the ages who usually invests were mostly around their 20s.

By Sex:

We can see that the data approximately has an equal count for both Female and Male. However, on average we can see that Males are charged a bit higher than their female counterpart.

By Smokers:

Smoker tends to be charged much higher as compared to non-smoker. Smokers are usually charged for over 30000 and below 15000 are generally non-smokers.

By BMI:

BMI is used as indicator is used to measure health risk for an individual, however this data does not show much of an effect to the charges.

By Region:

We can see from the graph that individuals from South East has a wider range of charges, while people from South West, North West, and North East are mostly charged lower than in the South East.

Conclusion

Smoking plays a big factor that affects Insurance Charges compared to other variables we have analyzed. Being a smoker, can approximately increase the charges by 25,000 regardless of age. Thereafter, the charges increase with age, bmi (higher health risk).

--

--