Python 3.7: Introducing Data Classes

Ernst Haagsman

Python 3.7 is set to be released this summer, let’s have a sneak peek at some of the new features! If you’d like to play along at home with PyCharm, make sure you get PyCharm 2018.1 (or later if you’re reading this from the future).

There are many new things in Python 3.7: various character set improvements, postponed evaluation of annotations, and more. One of the most exciting new features is support for the dataclass decorator.

What is a Data Class?

Most Python developers will have written many classes which looks like:

class MyClass:
    def __init__(self, var_a, var_b):
        self.var_a = var_a
        self.var_b = var_b

Data classes help you by automatically generating dunder methods for simple cases. For example, a init which accepted those arguments and assigned each to self. The small example before could be rewritten like:

@dataclass
class MyClass:
    var_a: str
    var_b: str

A key difference is that type hints are actually required for data classes. If you’ve never used a type hint before: they allow you to mark what type a certain variable should be. At runtime, these types are not checked, but you can use PyCharm or a command-line tool like mypy to check your code statically.

So let’s have a look at how we can use this!

The Star Wars API

You know a movie’s fanbase is passionate when a fan creates a REST API with the movie’s data in it. One Star Wars fan has done exactly that, and created the Star Wars API. He’s actually gone even further, and created a Python wrapper library for it.

Let’s forget for a second that there’s already a wrapper out there, and see how we could write our own.

We can use the requests library to get a resource from the Star Wars API:

response = requests.get('https://swapi.co/api/films/1/')

This endpoint (like all swapi endpoints) responds with a JSON message. Requests makes our life easier by offering JSON parsing:

dictionary = response.json()

And at this point we have our data in a dictionary. Let’s have a look at it (shortened):

{
 'characters': ['https://swapi.co/api/people/1/',
                … ],
 'created': '2014-12-10T14:23:31.880000Z',
 'director': 'George Lucas',
 'edited': '2015-04-11T09:46:52.774897Z',
 'episode_id': 4,
 'opening_crawl': 'It is a period of civil war.\r\n … ',
 'planets': ['https://swapi.co/api/planets/2/',
     ...],
 'producer': 'Gary Kurtz, Rick McCallum',
 'release_date': '1977-05-25',
 'species': ['https://swapi.co/api/species/5/',
                 ...],
 'starships': ['https://swapi.co/api/starships/2/',
                   ...],
 'title': 'A New Hope',
 'url': 'https://swapi.co/api/films/1/',
 'vehicles': ['https://swapi.co/api/vehicles/4/',
                  ...]
}

Wrapping the API

To properly wrap an API, we should create objects that our wrapper’s user can use in their application. So let’s define an object in Python 3.6 to contain the responses of requests to the /films/ endpoint:

class StarWarsMovie:

   def __init__(self,
                title: str,
                episode_id: int,
                opening_crawl: str,
                director: str,
                producer: str,
                release_date: datetime,
                characters: List[str],
                planets: List[str],
                starships: List[str],
                vehicles: List[str],
                species: List[str],
                created: datetime,
                edited: datetime,
                url: str
                ):

       self.title = title
       self.episode_id = episode_id
       self.opening_crawl= opening_crawl
       self.director = director
       self.producer = producer
       self.release_date = release_date
       self.characters = characters
       self.planets = planets
       self.starships = starships
       self.vehicles = vehicles
       self.species = species
       self.created = created
       self.edited = edited
       self.url = url

       if type(self.release_date) is str:
           self.release_date = dateutil.parser.parse(self.release_date)

       if type(self.created) is str:
           self.created = dateutil.parser.parse(self.created)

       if type(self.edited) is str:
           self.edited = dateutil.parser.parse(self.edited)

Careful readers may have noticed a little bit of duplicated code here. Not so careful readers may want to have a look at the complete Python 3.6 implementation: it’s not short.

This is a classic case of where the data class decorator can help you out. We’re creating a class that mostly holds data, and only does a little validation. So let’s have a look at what we need to change.

Firstly, data classes automatically generate several dunder methods. If we don’t specify any options to the dataclass decorator, the generated methods are: __init__, __eq__, and __repr__. Python by default (not just for data classes) will implement __str__ to return the output of __repr__ if you’ve defined __repr__ but not __str__. Therefore, you get four dunder methods implemented just by changing the code to:

@dataclass
class StarWarsMovie:
   title: str
   episode_id: int
   opening_crawl: str
   director: str
   producer: str
   release_date: datetime
   characters: List[str]
   planets: List[str]
   starships: List[str]
   vehicles: List[str]
   species: List[str]
   created: datetime
   edited: datetime
   url: str

We removed the __init__ method here to make sure the data class decorator can add the one it generates. Unfortunately, we lost a bit of functionality in the process. Our Python 3.6 constructor didn’t just define all values, but it also attempted to parse dates. How can we do that with a data class?

If we were to override __init__, we’d lose the benefit of the data class. Therefore a new dunder method was defined for any additional processing: __post_init__. Let’s see what a __post_init__ method would look like for our wrapper class:

def __post_init__(self):
   if type(self.release_date) is str:
       self.release_date = dateutil.parser.parse(self.release_date)

   if type(self.created) is str:
       self.created = dateutil.parser.parse(self.created)

   if type(self.edited) is str:
       self.edited = dateutil.parser.parse(self.edited)

And that’s it! We could implement our class using the data class decorator in under a third of the number of lines as we could without the data class decorator.

More goodies

By using options with the decorator, you can tailor data classes further for your use case. The default options are:

@dataclass(init=True, repr=True, eq=True, order=False, unsafe_hash=False, frozen=False)

  • init determines whether to generate the __init__ dunder method.
  • repr determines whether to generate the __repr__ dunder method.
  • eq does the same for the __eq__ dunder method, which determines the behavior for equality checks (your_class_instance == another_instance).
  • order actually creates four dunder methods, which determine the behavior for all lesser than and/or more than checks. If you set this to true, you can sort a list of your objects.

The last two options determine whether or not your object can be hashed. This is necessary (for example) if you want to use your class’ objects as dictionary keys. A hash function should remain constant for the life of the objects, otherwise the dictionary will not be able to find your objects anymore. The default implementation of a data class’ __hash__ function will return a hash over all objects in the data class. Therefore it’s only generated by default if you also make your objects read-only (by specifying frozen=True).

By setting frozen=True any write to your object will raise an error. If you think this is too draconian, but you still know it will never change, you could specify unsafe_hash=True instead. The authors of the data class decorator recommend you don’t though.

If you want to learn more about data classes, you can read the PEP or just get started and play with them yourself! Let us know in the comments what you’re using data classes for!

Comments below can no longer be edited.

18 Responses to Python 3.7: Introducing Data Classes

  1. Varun Ramesh says:

    April 18, 2018

    It seems to me that the ‘StarWarsMovie’ dataclass will fail a static type check if a string is passed in as an argument for ‘release_date’, ‘created’, or ‘edited’. Since type annotations support unions, I think that ‘Union[datetime, str]’ might be the right annotation.

  2. Peter Norvig says:

    April 18, 2018

    `post_init` could be

    for attr in [‘release_date’, ‘created’, ‘edited’]:
    if isinstance(getattr(self, attr), str):
    setattr(self, attr, dateutil.parser.parse(getattr(self, attr)))

    • Wiliam says:

      April 18, 2018

      I wonder if this hinders readability or understanding. Does it? When do we star tto worry about these small details?

      • victor n. says:

        April 18, 2018

        oh wow.

        Wiliam, i don’t know about readability/understanding but what Peter wrote above is what is more maintainable. it’s preferable (imo) to the original. all you need to do now is add attributes to that list above and everything automagically works.

      • Kevin says:

        April 18, 2018

        Not really, it might look weird to someone with little coding experience in Python (<2 years) but with a few years of proficiency this sort of thing becomes commonplace. Although I probably wouldn’t write it exactly how the guy above did, or I’d at least surround all that getattr/setattr stuff with a comment explaining why this is done in a loop.

        The loop takes about 3 lines so if I have only 3 attributes to do this on I might not turn it into a loop/dynamic thing.

        • MrObvious says:

          April 19, 2018

          Dude, that’s Peter ‘f*cking Norvig ” .. That guy above” jeez.. lol 😉

      • Anentropic says:

        April 19, 2018

        it’s better in every way

    • Kevin Galkov says:

      April 18, 2018

      I wonder if this hinders readability or understanding. Does it? When do we start to worry about these small details?

      • Chris Adams says:

        April 18, 2018

        Kevin: I generally prefer this style because it makes it immediately obvious that all of the listed fields have intentionally identical behaviour. That might be a minor thing now but it tends to avoid bugs later when maintenance work means someone either has to confirm that intention or, worse, misses an instance and the behaviour is subtly no longer consistent.

    • Jeremy Kun says:

      April 18, 2018

      Peter has spoken.

    • Quaint Alien says:

      April 18, 2018

      Love your code golfing tricks!

  3. tm says:

    April 18, 2018

    Reviewing the non-dataclass class, if your constructor can take a `str` or `datetime` argument for the date objects, shouldn’t the __init__ arguments for the date objects be `Union[str, datetime]`?

    Also, mypy doesn’t like the way that the parse function is called with a typed `datetime` argument: `Argument 1 to “parse” has incompatible type “datetime”; expected “Union[bytes, str, IO[str], IO[Any]]”` Not sure how you rectified that.

  4. Darren says:

    April 18, 2018

    That moves python closer to the scala case class
    https://docs.scala-lang.org/tour/case-classes.html

    Given python and scala are commonly used in big-data (Spark), some kind of python/scala convergence is not too surprising. The key difference, however, is that scala will catch type errors at compile-time.

    • Anentropic says:

      April 19, 2018

      mypy will catch Python type errors at “build time” i.e. whenever you choose to run mypy, perhaps as part of your CI tests

  5. Brian Bruggeman says:

    April 18, 2018

    on dataclasses, I’m not sure types are needed anymore: https://twitter.com/raymondh/status/959153776484470784?lang=en

    I think this is super awesome; Python should never require types.

    • Wagner Macedo says:

      April 20, 2018

      And we lose readability. Like it or not, we programmers have to deal with data types every time, if there is a standard way to document the types (this was the main reason behind type hinting), why not to use?

  6. bc says:

    April 19, 2018

    Better to use a library like Traits (https://pypi.org/project/traits/) or Atom (https://pypi.org/project/atom/).

  7. Eric Frederich says:

    April 19, 2018

    What if the json-object returns a key which a reserved word or otherwise not a valid Python variable name?
    I supposed you could define a @classmethod called from_json_response or something which would then return something like cls(a=data[‘a’], b=data[‘b’], …etc) where a mapping of json names to python names could be enumerated. Unfortunately this seems to repeat a lot of code.

    I think golang lets you decorate structs saying what the JSON keys should be when serializing/deserializing.

Subscribe

Subscribe for updates