Python 3.7: Introducing Data Classes
Python 3.7 is set to be released this summer, let’s have a sneak peek at some of the new features! If you’d like to play along at home with PyCharm, make sure you get PyCharm 2018.1 (or later if you’re reading this from the future).
There are many new things in Python 3.7: various character set improvements, postponed evaluation of annotations, and more. One of the most exciting new features is support for the dataclass
decorator.
What is a Data Class?
Most Python developers will have written many classes which looks like:
class MyClass: def __init__(self, var_a, var_b): self.var_a = var_a self.var_b = var_b
Data classes help you by automatically generating dunder methods for simple cases. For example, a init which accepted those arguments and assigned each to self. The small example before could be rewritten like:
@dataclass class MyClass: var_a: str var_b: str
A key difference is that type hints are actually required for data classes. If you’ve never used a type hint before: they allow you to mark what type a certain variable should be. At runtime, these types are not checked, but you can use PyCharm or a command-line tool like mypy to check your code statically.
So let’s have a look at how we can use this!
The Star Wars API
You know a movie’s fanbase is passionate when a fan creates a REST API with the movie’s data in it. One Star Wars fan has done exactly that, and created the Star Wars API. He’s actually gone even further, and created a Python wrapper library for it.
Let’s forget for a second that there’s already a wrapper out there, and see how we could write our own.
We can use the requests library to get a resource from the Star Wars API:
response = requests.get('https://swapi.co/api/films/1/')
This endpoint (like all swapi endpoints) responds with a JSON message. Requests makes our life easier by offering JSON parsing:
dictionary = response.json()
And at this point we have our data in a dictionary. Let’s have a look at it (shortened):
{ 'characters': ['https://swapi.co/api/people/1/', … ], 'created': '2014-12-10T14:23:31.880000Z', 'director': 'George Lucas', 'edited': '2015-04-11T09:46:52.774897Z', 'episode_id': 4, 'opening_crawl': 'It is a period of civil war.\r\n … ', 'planets': ['https://swapi.co/api/planets/2/', ...], 'producer': 'Gary Kurtz, Rick McCallum', 'release_date': '1977-05-25', 'species': ['https://swapi.co/api/species/5/', ...], 'starships': ['https://swapi.co/api/starships/2/', ...], 'title': 'A New Hope', 'url': 'https://swapi.co/api/films/1/', 'vehicles': ['https://swapi.co/api/vehicles/4/', ...] }
Wrapping the API
To properly wrap an API, we should create objects that our wrapper’s user can use in their application. So let’s define an object in Python 3.6 to contain the responses of requests to the /films/ endpoint:
class StarWarsMovie: def __init__(self, title: str, episode_id: int, opening_crawl: str, director: str, producer: str, release_date: datetime, characters: List[str], planets: List[str], starships: List[str], vehicles: List[str], species: List[str], created: datetime, edited: datetime, url: str ): self.title = title self.episode_id = episode_id self.opening_crawl= opening_crawl self.director = director self.producer = producer self.release_date = release_date self.characters = characters self.planets = planets self.starships = starships self.vehicles = vehicles self.species = species self.created = created self.edited = edited self.url = url if type(self.release_date) is str: self.release_date = dateutil.parser.parse(self.release_date) if type(self.created) is str: self.created = dateutil.parser.parse(self.created) if type(self.edited) is str: self.edited = dateutil.parser.parse(self.edited)
Careful readers may have noticed a little bit of duplicated code here. Not so careful readers may want to have a look at the complete Python 3.6 implementation: it’s not short.
This is a classic case of where the data class decorator can help you out. We’re creating a class that mostly holds data, and only does a little validation. So let’s have a look at what we need to change.
Firstly, data classes automatically generate several dunder methods. If we don’t specify any options to the dataclass
decorator, the generated methods are: __init__
, __eq__
, and __repr__
. Python by default (not just for data classes) will implement __str__
to return the output of __repr__
if you’ve defined __repr__
but not __str__
. Therefore, you get four dunder methods implemented just by changing the code to:
@dataclass class StarWarsMovie: title: str episode_id: int opening_crawl: str director: str producer: str release_date: datetime characters: List[str] planets: List[str] starships: List[str] vehicles: List[str] species: List[str] created: datetime edited: datetime url: str
We removed the __init__
method here to make sure the data class decorator can add the one it generates. Unfortunately, we lost a bit of functionality in the process. Our Python 3.6 constructor didn’t just define all values, but it also attempted to parse dates. How can we do that with a data class?
If we were to override __init__
, we’d lose the benefit of the data class. Therefore a new dunder method was defined for any additional processing: __post_init__
. Let’s see what a __post_init__
method would look like for our wrapper class:
def __post_init__(self): if type(self.release_date) is str: self.release_date = dateutil.parser.parse(self.release_date) if type(self.created) is str: self.created = dateutil.parser.parse(self.created) if type(self.edited) is str: self.edited = dateutil.parser.parse(self.edited)
And that’s it! We could implement our class using the data class decorator in under a third of the number of lines as we could without the data class decorator.
More goodies
By using options with the decorator, you can tailor data classes further for your use case. The default options are:
@dataclass(init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False)
- init determines whether to generate the
__init__
dunder method. - repr determines whether to generate the
__repr__
dunder method. - eq does the same for the
__eq__
dunder method, which determines the behavior for equality checks (your_class_instance == another_instance
). - order actually creates four dunder methods, which determine the behavior for all lesser than and/or more than checks. If you set this to true, you can sort a list of your objects.
The last two options determine whether or not your object can be hashed. This is necessary (for example) if you want to use your class’ objects as dictionary keys. A hash function should remain constant for the life of the objects, otherwise the dictionary will not be able to find your objects anymore. The default implementation of a data class’ __hash__
function will return a hash over all objects in the data class. Therefore it’s only generated by default if you also make your objects read-only (by specifying frozen=True
).
By setting frozen=True
any write to your object will raise an error. If you think this is too draconian, but you still know it will never change, you could specify unsafe_hash=True
instead. The authors of the data class decorator recommend you don’t though.
If you want to learn more about data classes, you can read the PEP or just get started and play with them yourself! Let us know in the comments what you’re using data classes for!
Varun Ramesh says:
April 18, 2018It seems to me that the ‘StarWarsMovie’ dataclass will fail a static type check if a string is passed in as an argument for ‘release_date’, ‘created’, or ‘edited’. Since type annotations support unions, I think that ‘Union[datetime, str]’ might be the right annotation.
Peter Norvig says:
April 18, 2018`post_init` could be
for attr in [‘release_date’, ‘created’, ‘edited’]:
if isinstance(getattr(self, attr), str):
setattr(self, attr, dateutil.parser.parse(getattr(self, attr)))
Wiliam says:
April 18, 2018I wonder if this hinders readability or understanding. Does it? When do we star tto worry about these small details?
victor n. says:
April 18, 2018oh wow.
Wiliam, i don’t know about readability/understanding but what Peter wrote above is what is more maintainable. it’s preferable (imo) to the original. all you need to do now is add attributes to that list above and everything automagically works.
Kevin says:
April 18, 2018Not really, it might look weird to someone with little coding experience in Python (<2 years) but with a few years of proficiency this sort of thing becomes commonplace. Although I probably wouldn’t write it exactly how the guy above did, or I’d at least surround all that getattr/setattr stuff with a comment explaining why this is done in a loop.
The loop takes about 3 lines so if I have only 3 attributes to do this on I might not turn it into a loop/dynamic thing.
MrObvious says:
April 19, 2018Dude, that’s Peter ‘f*cking Norvig ” .. That guy above” jeez.. lol 😉
Anentropic says:
April 19, 2018it’s better in every way
Kevin Galkov says:
April 18, 2018I wonder if this hinders readability or understanding. Does it? When do we start to worry about these small details?
Chris Adams says:
April 18, 2018Kevin: I generally prefer this style because it makes it immediately obvious that all of the listed fields have intentionally identical behaviour. That might be a minor thing now but it tends to avoid bugs later when maintenance work means someone either has to confirm that intention or, worse, misses an instance and the behaviour is subtly no longer consistent.
Jeremy Kun says:
April 18, 2018Peter has spoken.
Quaint Alien says:
April 18, 2018Love your code golfing tricks!
tm says:
April 18, 2018Reviewing the non-dataclass class, if your constructor can take a `str` or `datetime` argument for the date objects, shouldn’t the __init__ arguments for the date objects be `Union[str, datetime]`?
Also, mypy doesn’t like the way that the parse function is called with a typed `datetime` argument: `Argument 1 to “parse” has incompatible type “datetime”; expected “Union[bytes, str, IO[str], IO[Any]]”` Not sure how you rectified that.
Darren says:
April 18, 2018That moves python closer to the scala case class
– https://docs.scala-lang.org/tour/case-classes.html
Given python and scala are commonly used in big-data (Spark), some kind of python/scala convergence is not too surprising. The key difference, however, is that scala will catch type errors at compile-time.
Anentropic says:
April 19, 2018mypy will catch Python type errors at “build time” i.e. whenever you choose to run mypy, perhaps as part of your CI tests
Brian Bruggeman says:
April 18, 2018on dataclasses, I’m not sure types are needed anymore: https://twitter.com/raymondh/status/959153776484470784?lang=en
I think this is super awesome; Python should never require types.
Wagner Macedo says:
April 20, 2018And we lose readability. Like it or not, we programmers have to deal with data types every time, if there is a standard way to document the types (this was the main reason behind type hinting), why not to use?
bc says:
April 19, 2018Better to use a library like Traits (https://pypi.org/project/traits/) or Atom (https://pypi.org/project/atom/).
Eric Frederich says:
April 19, 2018What if the json-object returns a key which a reserved word or otherwise not a valid Python variable name?
I supposed you could define a @classmethod called from_json_response or something which would then return something like cls(a=data[‘a’], b=data[‘b’], …etc) where a mapping of json names to python names could be enumerated. Unfortunately this seems to repeat a lot of code.
I think golang lets you decorate structs saying what the JSON keys should be when serializing/deserializing.