Dataclasses: Your Secret Weapon for Productive Data Science

Data classes are a powerful tool that can help you save time, write more concise code, and make your data more reusable.

Ravish Kumar
Python in Plain English

--

Dataclasses are a special type of class in Python that provides a number of benefits for data scientists and data engineers. These benefits include:

  • Automatic generation of init, repr, and eq methods
  • Type hints
  • Immutability

In this article, I will explain what data classes are, why they are useful, and how they can be used in data science and data engineering. I will also show you some examples of how data classes can be used with real data from Kaggle, the Avocado Prices dataset.

What are data classes?

Data classes are a feature introduced in Python 3.7 that allows you to create classes that are mainly used to store data. They are similar to named tuples but with more flexibility and functionality.

To create a data class, you need to use the @dataclass decorator from the dataclasses module. This decorator will automatically generate some special methods for your class, such as init, repr, and eq, based on the fields you define.

For example, let’s create a simple data class to store information about a person:

from dataclasses import dataclass

@dataclass
class Person:
name: str
age: int
address: str

This will create a class called Person with three fields: name, age, and address. The type hints after the field names are optional, but they can help you to check the validity of your data and use tools like mypy or PyCharm.

Now, let’s create an instance of this class:

person = Person(name="Alice", age=25, address="123 Main Street")

This will call the init method that was automatically generated by the @dataclass decorator. This method will assign the values of the arguments to the corresponding fields of the instance.

We can also print the instance using the repr method generated by the decorator. This method will return a string representation of the instance that shows the class name and the field values.

print(person)
# Output: Person(name='Alice', age=25, address='123 Main Street')

We can also compare two instances of the same class using the eq method that was generated by the decorator. This method will return True if all the fields of the instances are equal, and False otherwise.

person1 = Person(name="Alice", age=25, address="123 Main Street")
person2 = Person(name="Bob", age=30, address="456 Main Street")

print(person1 == person2)
# Output: False

Why use data classes?

Data classes are useful for data scientists and data engineers because they can help you to:

  • Store data in a structured format
  • Serialize and deserialize data
  • Test data
  • Document data

Let’s see how these benefits can be applied in practice.

Storing data

Data classes can be used to store data in a structured format that is easy to access and manipulate. This can be useful for storing data that will be used in machine learning models, for example.

For instance, let’s say we want to store information about avocados, such as their date, average price, total volume, type, and region. We can use a data class to create a custom type for avocados:

from dataclasses import dataclass
from datetime import date

@dataclass
class Avocado:
date: date
average_price: float
total_volume: float
type: str
region: str

Now, we can create instances of this class using real data from Kaggle. The Avocado Prices dataset contains historical data on avocado prices and sales volume in multiple US markets.

We can use pandas to read the CSV file and convert it to a list of Avocado objects:

import pandas as pd

df = pd.read_csv("avocado.csv")

avocados = [Avocado(date=row["Date"], average_price=row["AveragePrice"], total_volume=row["Total Volume"], type=row["type"], region=row["region"]) for index, row in df.iterrows()]

Now we have a list of avocados that we can use for further analysis or modelling. For example, we can filter the list by type or region using list comprehensions:

organic_avocados = [avocado for avocado in avocados if avocado.type == "organic"]
california_avocados = [avocado for avocado in avocados if avocado.region == "California"]

We can also access the fields of each avocado using dot notation:

first_avocado = avocados[0]
print(first_avocado.date)
# Output: 2015-12-27
print(first_avocado.average_price)
# Output: 1.33

Serializing and deserializing data

Dataclasses can be easily serialized and deserialized, which makes them ideal for working with data in files or databases. Serialization is the process of converting an object into a format that can be stored or transmitted, such as JSON or pickle. Deserialization is the reverse process of converting a serialized format back into an object.

For example, let’s say we want to save our list of avocados to a file using Pickle, a built-in module that implements binary protocols for serializing and deserializing Python objects.

We can use the pickle.dumps function to serialize our list of avocados into a bytes object:

import pickle

serialized_avocados = pickle.dumps(avocados)

We can then use the pickle.loads function to deserialize the bytes object back into a list of avocados:

deserialized_avocados = pickle.loads(serialized_avocados)

assert deserialized_avocados == avocados
# Output: True

We can also use other formats for serialization and deserialization, such as JSON. However, we need to use some custom functions to handle the conversion of date objects, since JSON does not support date types natively.

We can use the datetime module to convert date objects to strings in the ISO format, and vice versa:

from datetime import datetime

def date_to_str(date_obj):
return date_obj.isoformat()

def str_to_date(date_str):
return datetime.fromisoformat(date_str).date()

We can then use the JSON module to serialize and deserialize our list of avocados using JSON. We need to pass the custom functions as arguments to the json.dumps and json.loads functions, using the default and object_hook parameters respectively:

import json

serialized_avocados = json.dumps(avocados, default=date_to_str)
deserialized_avocados = json.loads(serialized_avocados, object_hook=lambda d: Avocado(**d, date=str_to_date(d["date"])))

assert deserialized_avocados == avocados
# Output: True

Testing data

Dataclasses can be used to create test data that is easy to generate and verify. Testing is an integral part of data science and data engineering, as it can help you to ensure the quality and reliability of your data and code.

For example, let’s say we want to test a function that calculates the average price of avocados by type and region. We can use a data class to create a test case that contains the input data, the expected output, and a name for the test:

from dataclasses import dataclass

@dataclass
class TestCase:
name: str
input_data: list
expected_output: dict

We can then create instances of this class using some sample data:

test_case_1 = TestCase(name="Test case 1", input_data=[Avocado(date=date(2015, 12, 27), average_price=1.33, total_volume=64236.62, type="conventional", region="Albany"), Avocado(date=date(2015, 12, 20), average_price=1.35, total_volume=54876.98, type="conventional", region="Albany"), Avocado(date=date(2015, 12, 27), average_price=1.49, total_volume=118220.22, type="organic", region="Albany"), Avocado(date=date(2015, 12, 20), average_price=1.53, total_volume=126497.42, type="organic", region="Albany")], expected_output={"conventional": 1.34, "organic": 1.51})

test_case_2 = TestCase(name="Test case 2", input_data=[Avocado(date=date(2015, 12, 27), average_price=0.93, total_volume=55979.78, type="conventional", region="PhoenixTucson"), Avocado(date=date(2015, 12, 20), average_price=0.95, total_volume=59776.87, type="conventional", region="PhoenixTucson"), Avocado(date=date(2015, 12, 27), average_price=1.19, total_volume=76111.27, type="organic", region="PhoenixTucson"), Avocado(date=date(2015, 12, 20), average_price=1.21, total_volume=98593.26,type=“organic”, region=“PhoenixTucson”)], expected_output={“conventional”: 0.94, “organic”: 1.2})

We can then use the unittest module to create a test class that inherits from unittest.TestCase defines a test method that uses the assertEqual function to check if the output of our function matches the expected output of each test case:

import unittest

class TestCalculateAveragePrice(unittest.TestCase):

def test_calculate_average_price(self):
for test_case in [test_case_1, test_case_2]:
with self.subTest(test_case.name):
output = calculate_average_price(test_case.input_data)
self.assertEqual(output, test_case.expected_output)

We can then run the test using the unittest.main function:

if __name__ == "__main__":
unittest.main()

This will run the test and report the results:

..
----------------------------------------------------------------------
Ran 2 tests in 0.001s

OK

Documenting data

Dataclasses can be used to generate documentation for your data, which can help to make your code more understandable and maintainable. Documentation is a crucial part of data science and data engineering, as it can help you and others to understand the purpose, structure, and meaning of your data and code.

One way to document your data classes is to use docstrings, which are special strings that describe the functionality of your classes, methods, or functions. Docstrings can be written in various formats, such as reStructuredText or Google style, and they can be parsed by tools like Sphinx or PyDoc to generate HTML or PDF documents.

For example, let’s add a docstring to our Avocado class using the Google style format:

from dataclasses import dataclass
from datetime import date

@dataclass
class Avocado:
"""A data class representing an avocado.

Attributes:
date: The date of observation.
average_price: The average price of a single avocado.
total_volume: The total number of avocados sold.
type: The type of avocado, either conventional or organic.
region: The region where the avocados were sold.
"""
date: date
average_price: float
total_volume: float
type: str
region: str

This docstring explains what the class is, what attributes it has, and what they mean. We can use the help function to see the docstring in the interactive shell:

help(Avocado)
# Output:
Help on class Avocado in module __main__:

class Avocado(builtins.object)
| Avocado(date: datetime.date, average_price: float, total_volume: float, type: str, region: str)
|
| A data class representing an avocado.
|
| Attributes:
| date: The date of observation.
| average_price: The average price of a single avocado.
| total_volume: The total number of avocados sold.
| type: The type of avocado, either conventional or organic.
| region: The region where the avocados were sold.
|
| Methods defined here:
|
| __eq__(self, other)
| Return self==value.
|
| __init__(self, date: datetime.date, average_price: float, total_volume: float, type: str, region: str) -> None
| Initialize self. See help(type(self)) for accurate signature.
|
| __repr__(self)
| Return repr(self).
...

We can also use tools like Sphinx or PyDoc to generate HTML or PDF documentation from our docstrings. For example, we can use PyDoc to generate an HTML file that contains the documentation for our Avocado class:

import pydoc

pydoc.writedoc(Avocado)
# Output:
wrote Avocado.html

This will create a file called Avocado.html that contains the documentation for our class.

Conclusion

Data classes are a powerful tool that can be used to simplify the work of data scientists and data engineers. They can help you store data in a structured format, serialize and deserialize, test, and document data. If you are working with data in Python, I encourage you to learn more about data classes and how they can be used to improve your code.

I hope you found this article helpful. If you have any feedback or questions, please let me know. 😊

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

--

--

Data engineer and storyteller unraveling the world of data engineering. Turning raw data into valuable insights. Let's embrace the power of data together!