Data visualization with Python and JavaScript

Diving into exploratory data analysis with Python, a JavaScript library for data visualization, and Jupyter

Veronika Rovnik
Python in Plain English

--

Any data science or data analytics project can be generally described with the following steps:

  1. Acquiring a business understanding & defining the goal of a project
  2. Getting data
  3. Preprocessing and exploring data
  4. Improving data, e.g., by feature engineering
  5. Visualizing data
  6. Building a model
  7. Deploying the model
  8. Scoring its performance

This time, I would like to bring your attention to the data cleaning and exploration phase since it’s a step which value is hard to measure, but the impact it brings is difficult to overestimate. Insights gained during this stage can affect all further work.

There are multiple ways you can start exploratory data analysis with:

  1. Load data and preprocess it: clean it from unnecessary artifacts, deal with missing values. Make your dataset comfortable to work with.
  2. Visualize as much data as possible using different kinds of plots & a pivot table.

Purpose

In this tutorial, I would like to show how to prepare your data with Python and explore it using a JavaScript library for data visualization. To get the most value out of exploration, I recommend using interactive visualizations since they make exploring your data faster and more comfortable.

Hence, we will present data in an interactive pivot table and pivot charts.

Hopefully, this approach will help you facilitate the data analysis and visualization process in Jupyter Notebook.

Set up your environment

Run your Jupyter Notebook and let’s start. If Jupyter is not installed on your machine, choose the way to get it.

Get your data

Choosing the data set to work with is the number one step.

If your data is already cleaned and ready to be visualized, jump to the Visualization section.

For demonstration purposes, I’ve chosen the data for the prediction of Bike Sharing Demand. It’s provided as data for the Kaggle’s competition.

Imports for this tutorial

Classically, we will use the “pandas” library to read data into a dataframe.

Additionally, we will need json and IPython.display modules. The former will help us serialize/deserialize data and the latter — render HTML in the cells.

Here’s the full code sample with imports we need:

from IPython.display import HTMLimport jsonimport pandas as pd

Read data

df = pd.read_csv('train.csv')

Clean & preprocess data

Before starting data visualization, it’s a good practice to see what’s going on in the data.

df.head()

df.info()

First, we should check the percentage of missing values.

missing_percentage = df.isnull().sum() * 100 / len(df)

There are a lot of strategies to follow when dealing with missing data. Let me mention the main ones:

  1. Dropping missing values. The only reason to follow this approach is when you need to quickly remove all NaNs from the data.
  2. Replacing NaNs with values. This is called imputation. A common decision is to replace missing values with zeros or with a mean value.

Luckily, we don’t have any missing values in the dataset. But if your data has, I suggest you look into a quick guide with the pros and cons of different imputation techniques.

Manage features data types

Let’s convert the type of “datetime”’ column from object to datetime:

df['datetime'] = pd.to_datetime(df['datetime'])

Now we are able to engineer new features based on this column, for example:

  • a day of the week
  • a month
  • an hour
df['weekday'] = df['datetime'].dt.dayofweekdf['hour'] = df['datetime'].dt.hourdf['month'] = df['datetime'].dt.month

These features can be used further to figure out trends in rent.

Next, let’s convert string types to categorical:

categories = ['season', 'workingday', 'weekday', 'hour', 'month', 'weather', 'holiday']for category in categories:    df[category] = df[category].astype('category')

Read more about when to use the categorical data type here.

Now, let’s make values of categorical more meaningful by replacing numbers with their categorical equivalents:

df['season'] = df['season'].replace([1, 2, 3, 4], ['spring', 'summer', 'fall', 'winter'])df['holiday'] = df['holiday'].replace([0, 1],['No', 'Yes'])

By doing so, it will be easier for us to interpret data visualization later on. We won’t need to look up the meaning of a category each time we need it.

Visualize data with a pivot table and charts

Now that you cleaned the data, let’s visualize it.

The data visualization type depends on the question you are asking.

In this tutorial, we’ll be using:

  • a pivot table for tabular data visualization
  • a bar chart

Prepare data for the pivot table

Before loading data to the pivot table, convert the dataframe to an array of JSON objects. For this, use the to_json() function from the json module.

The records orientation is needed to make sure the data is aligned according to the format the pivot table requires.

json_data = df.to_json(orient=”records”)

Create a pivot table

Next, define a pivot table object and feed it with the data. Note that the data has to be deserialized using the loads() function that decodes JSON:

pivot_table = {
"container": "#pivot-container",
"componentFolder": "https://cdn.flexmonster.com/",
"toolbar": True,
"report": {
"dataSource": {
"type": "json",
"data": json.loads(json_data)
},
"slice": {
"rows": [{
"uniqueName": "weekday"
}],
"columns": [{
"uniqueName": "[Measures]"
}],
"measures": [{
"uniqueName": "count",
"aggregation": "median"
}],
"sorting": {
"column": {
"type": "desc",
"tuple": [],
"measure": {
"uniqueName": "count",
"aggregation": "median"
}
}
}
}
}
}

In the above pivot table initialization, we specified a simple report that consists of a slice (a set of fields visible on the grid), data source, options, formats, etc. We also specified a container where the pivot table should be rendered. The container will be defined a bit later.

Plus, here we can add a mapping object to prettify the field captions or set their data types. Using this object eliminates the need in modifying the data source.

Next, convert the pivot table object to a JSON-formatted string to be able to pass it for rendering in the HTML layout:

pivot_json_object = json.dumps(pivot_table)

Define a dashboard layout

Define a function that renders the pivot table in the cell:

In this function, we call HTML() from the IPython.display module — it will render the layout enclosed into a multi-line string.

Next, let’s call this function and pass to it the pivot table previously encoded into JSON:

render_pivot_table(pivot_json_object)

Likewise, you can create and render as many data visualization components as you need. For example, interactive pivot charts that visualize aggregated data:

What’s next

Now that you embedded the pivot table into Jupyter, it’s time to start exploring your data:

  • drag and drop fields to rows, columns, and measures of the pivot table
  • set Excel-like filtering
  • highlight important values with conditional formatting

At any moment, you can save your results to a JSON or PDF/Excel/HTML report.

Examples

Here is how you can try identifying trends on bikes usage depending on the day of the week:

You can also figure out if any weather conditions affect the number of rents by registered and unregistered users:

To dig deeper into the data, drill through aggregated values by double-clicking and see the raw records they are composed of:

Or simply switch to the pivot charts mode and give your data an even more comprehensible look:

Bringing it all together

By completing this tutorial, you learned a new way to interactively explore your multi-dimensional data in Jupyter Notebook using Python and the JavaScript data visualization library. I hope this will make your exploration process more insightful than before.

Useful links

--

--

Passionate about mathematics, machine learning, and technologies. Studying approaches in the field of data analysis and visualization. Open for new ideas :)