Intro to Data Structuring, .set_index() & Seaborn Lineplots

An overview of data types and how they impact the structure of a dataset.

Kat Hernandez
Python in Plain English

--

Cleaning out my garden the other week, I was reminded of a concept I learned in school. My daffodils were choking each other for space in the front corner of my garden, so I dug them out and separated them in a grid-like format. As I was dividing the bulbs, I was reminded of a split-plot-design research paper I wrote in grad school. And how that research gave me a comprehensive understanding of data structure.

Photo by Ben Collins on Unsplash

First, let me provide an overview of data types and how they impact the structure of a dataset. This will provide an understanding of why these data types are important for data visualization.

Staying true to the gardening theme, here we have a snapshot of a simulated dataset of two types of flowers, roses, and daffodils. On the first of each month, the various flowers identified by their Flower Key have their petal length measured and their colors are recorded according to the RGB Color Codes:

For the purposes of this data set, the Petal Length and the Color Codes Columns are the data values. They are considered data values because they are the smallest units of data in the dataset. The Flower Keys are data keys. Keys are identifiers, they do not provide value outside of linking data, and they are often synonymous with index values. Keys can be unique to their own dataset, also known as local keys, or they can be universal keys. A local key would only identify my flowers in this dataset. To make my Flower Keys universal, two conditions must be met:

  1. The Flower Key is still unique when combined with other data sets
  2. When looking for my flowers data another dataset, the Flower Key can be used to find additional information about that flower.

For more information on getting started with creating unique keys, check out my post on unique IDs here.

My Flower Type column is binary; only two types of flowers exist in this data. Being limited to only two possible options, binary data provides insight when plotted in comparison using a legend. Binary data is also useful for hierarchical indexing- aggregating and sorting data for further analysis.

The Date Column is ordinal data, which follows a chronological order. For plotting purposes, ordinal data makes for a great x-axis to see how the data fluctuates over time. It is often aggregated together with a hierarchal index. Let’s take a look at the data with a simple hierarchal index applied:

By using Flower Type as a primary index, we now have an aggregated view of all of the two flower types. With the Date as Secondary Index, the flowers' respective Petal Lengths and Color Numbers can be sorted chronologically in visualizations.

We now understand what types of data we have in our dataset and have indexed the data accordingly. Based on that information, we can now quickly create a visualization that tells us the most comprehensive data story. See below:

The above plot gives us a comparison of the two flower types and how their petal lengths varied over the course of the 6-month study. The lines on the plot follow the mean Petal Length for each month. The shaded areas provide the visual distribution of the binary data Flower Type. As an example, the Petal Length for the roses varied much more at the 6/1/2021 measurement than at the 4/1/2021 measurement.

The code to generate the visual is called a few packages. Seaborn is a data visualization package for Python that provides a wealth of tools. It’s the main engine behind this line plot. We will explore the capabilities of Seaborn further in later posts. Seaborn runs with Matplotlib in the background, hence the two lines referencing Matplotlib. The remainder of the code to generate the Seaborn plot is relatively straightforward. In the one line of code, we call on the dataset named “flower”, and use the ordinal x value of “date” to show the fluctuation of Petal Length over time. For y we visualize the Petal Length. Finally, use the binary data to show the comparison of the two flowers in the legend.

By identifying the data correctly, we’ve taken something as beautiful as a rose, and as vibrant as a daffodil and turned it into something (almost) as beautiful as a seaborn visualization.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.

--

--