Python Bug Analysis: dataclasses and namedtuples
tl;dr: I found a bug in Python 3.7.0, skip ahead to The Bug to try it out yourself.
Background
In Python 3.7, dataclasses
was added to make a few programming use-cases easier to manage.
Dataclasses eliminate boilerplate code one would write in Python <3.7.
# Python 3.6
class Example:
def __init__(self, val1: str, val2: str, val3: str):
self.val1 = val1
self.val2 = val2
self.val3 = val3
example = Example("here's", "an", "example")
This code can be rewritten, like so:
# Python 3.7
from dataclasses import dataclass
@dataclass
class Example:
val1: str
val2: str
val3: str
example = Example("here's", "an", "example")
Dataclasses provide us with automatic comparison dunder-methods, the ability make our objects mutable/immutable and the ability to decompose them into dictionary of type Dict[str, Any]
.
Let’s see that in action:
from dataclasses import dataclass
@dataclass
class Example:
val1: str
val2: str
val3: str
example = Example("here's", "an", "example")
print(asdict(example))
>>> {'val1': "here's", 'val2': 'an', 'val3': 'example'}
Awesome! I’m sure you can find a few situations where this would be useful.
The Bug
What happens when you compose a dataclass
with a namedtuple
?
from dataclasses import dataclass, asdict
from typing import NamedTuple
class NamedTupleAttribute(NamedTuple):
example: bool
@dataclass
class Data:
attr: NamedTupleAttribute
data = Data(NamedTupleAttribute(example=True))
data_dict = asdict(data)
namedtuple_attr = data_dict['attr']
print(namedtuple_attr.example)
>>> <generator object _asdict_inner.<locals>.<genexpr> at 0x107f45408>
Shouldn’t data.attr.example
be of type bool
? Why does namedtuple_attr.example
evaluate to a generator expression?
To answer those questions, we’ll need to look at a few things. First, tuple
vs namedtuple
factories and then asdict()
’s implementation.
tuple()
takes an iterable as its only argument and exhausts it while building a new object. However, namedtuple()
takes arbitrarily many arguments and does not exhaust generators supplied as arguments.
print(tuple(x for x in range(5)))
>>> (0, 1, 2, 3, 4)
print(NamedTupleAttribute(x for x in range(5)))
>>> NamedTupleAttribute(example=<generator object <genexpr> at 0x107f45318>)
Where does this fit in with asdict()
? We’ll need to look at its implementation to understand.
def asdict(obj, *, dict_factory=dict):
if not _is_dataclass_instance(obj):
raise TypeError("asdict() should be called on dataclass instances")
return _asdict_inner(obj, dict_factory)
def _asdict_inner(obj, dict_factory):
if _is_dataclass_instance(obj):
result = []
for f in fields(obj):
value = _asdict_inner(getattr(obj, f.name), dict_factory)
result.append((f.name, value))
return dict_factory(result)
elif isinstance(obj, (list, tuple)):
return type(obj)(_asdict_inner(v, dict_factory) for v in obj) # right here
elif isinstance(obj, dict):
return type(obj)((_asdict_inner(k, dict_factory), _asdict_inner(v, dict_factory))
for k, v in obj.items())
else:
return copy.deepcopy(obj)
_asdict_inner()
will pass a generator to objects that are of type tuple
, expecting them to get consumed by the tuple factory.
typing.NamedTuple
and collections.namedtuple()
are of type tuple
, but override its __new__()
functionality.
Here’s what happens when asdict()
is called on a dataclass
that has a namedtuple
:
_asdict_inner()
recurses on the dataclass object’s fields- When it reaches a field with the type
tuple
, it calls the object-type’s constructor with a generator expression of fields. - If it’s a
tuple
, the generator expression is exhausted and a tuple with the generator’s values is produced. - If it’s a
NamedTuple
, the anonymous generator expression object is not iterated over and is assigned as a field on theNamedTuple
. asdict()
returns a newDict[str, Any]
with malformedNamedTuple
s.
Proposed Solutions
Both Eric V. Smith and Ivan Levkivskyi have quickly proposed solutions to this issue.
Ivan Levkivskyi has suggested that _asdict_inner
apply a generator expression only to the standard libraries types list
and tuple
, then allowing NamedTuples
to follow the branch that becomes deep-copied.
Eric Smith proposed a solution in which the generator expression is expanded with star-notation as it is passed in the tuple factory method.
In my (very humble) opinion, NamedTuple
is a special case of tuple
in the standard library. Since it is a special case in the stdlib, one solution might be to branch on namedtuple
with special behavior in _asdict_inner()
.
def _asdict_inner(obj, dict_factory):
if _is_dataclass_instance(obj):
result = []
for f in fields(obj):
value = _asdict_inner(getattr(obj, f.name), dict_factory)
result.append((f.name, value))
return dict_factory(result)
elif isinstance(obj, typing.NamedTuple):
return type(obj)(*(_asdict_inner(v, dict_factory) for v in obj)) # right here
elif isinstance(obj, (list, tuple)):
return type(obj)(_asdict_inner(v, dict_factory) for v in obj)
elif isinstance(obj, dict):
return type(obj)((_asdict_inner(k, dict_factory), _asdict_inner(v, dict_factory))
for k, v in obj.items())
else:
return copy.deepcopy(obj)
Conclusion
Python 3.7 introduces new features that will make development even faster and more fun.
Though I ran into a bug, it is an edge case. Response to my bug report was very quick, polite and professional. A couple of very intelligent people jumped into bugfix mode almost instantly after it was reported.
In conclusion, the rate at which the Python community responds to developer needs and concerns is impressive. Thanks to the contributors that make this project a success!