Python’s dataclasses
module, added in 3.7, is a great way to create classes designed to hold data. Although they don’t do anything that a regular class couldn’t do, they take out a lot of boilerplate code and let you focus on the data.
If you aren’t already familiar with dataclasses, check out the docs. There are also plenty of great tutorials covering their features.
In this tutorial, we’re going to look at a way to write tools that extend dataclasses.
Let’s start with a simple dataclass that holds a UUID, username, and email address of a user.
from dataclasses import dataclass, field import uuid @dataclass class UserData: username: str email: str _id: uuid.UUID = field(default_factory=uuid.uuid4) if __name__ == "__main__": username = input("Enter username: ") email = input("Enter your email address: ") data = UserData(username, email) print(data)
This is pretty simple. Ask the user for a username and an email address, then show them the shiny new data class instance that we made using their information. The class will, by default, generate a unique id for every user.
But what if we have sneaky users who might try giving an invalid email address, just to break things?
It’s simple enough to extend data classes to support field validation. dataclass
is just a decorator that takes a class and adds various methods and attributes to it, so let’s make our own decorator that does the same thing.
def validated_dataclass(cls): cls.__post_init__ = lambda self: print("Initializing!") cls = dataclass(cls) return cls @validated_dataclass class UserData: ...
Here, we add a simple __post_init__
method to the class, which will be called by the data class every time we instantiate the class. But how can we use this power to validate an email address?
This is where the metadata
argument of a field
comes in. Basically, it’s a dict
that we can set when defining a field in the data class. It’s completely ignored by the regular dataclass
implementation, so we can use it to include information about the field for our own purposes.
Here’s how UserData looks after adding a validator for the email field.
from dataclasses import dataclass, field def validate_email(value): if "@" not in value: raise ValueError("There must be an '@' in your email!") return value @validated_dataclass class UserData: username: str email: str = field(metadata={"validator": validate_email}) _id: uuid.UUID = field(default_factory=uuid.uuid4)
Now the email field of the data class will carry around that validator function, so that anyone can access it. Let’s update the decorator to make use of it.
from dataclasses import dataclass, field, fields def validated_dataclass(cls): cls = dataclass(cls) def _set_attribute(self, attr, value): for field in fields(self): if field.name == attr and "validator" in field.metadata: value = field.metadata["validator"](value) break object.__setattr__(self, attr, value) cls.__setattr__ = _set_attribute return cls
The new decorator replaces the regular __setattr__
with a function that first looks at the metadata of the fields. If there is a validator function associated with the attribute, it calls the function and uses its return value as the value to set.
The power of this approach is that now anybody can validate fields on their data classes by importing this decorator and defining a validator function in the metadata of their field. It’s a drop-in replacement to extend any data class.
One downside to this is the performance cost. Even attributes that don’t need validation will run through the list of fields every time they’re set. In another article, I’ll look at how much of a cost this actually is, and explore some optimizations we can make to reduce the overhead.
Another downside is the potential lack of readability of setting metadata on every field. If that becomes a problem, you could try defining the metadata dict
elsewhere, so the field would look like email: str = field(metadata=email_metadata)
.
The possible uses of metadata
are limitless! Combined with custom decorators that use dataclass
behind the scenes, we can add all sorts of functionality to data classes.
For serious validation needs, it’s still most likely to be better to use something like Pydantic or Marshmallow, rather than make your own. Both of them have either built-in support for data classes, or there are other packages available to add that support.
If you have any ideas for extending data classes, let me know in the comments!