A common problem that you can face when working with Python dictionaries is to try to access or modify keys that don’t exist in the dictionary. This will raise a KeyError
and break up your code execution. To handle these kinds of situations, the standard library provides the Python defaultdict
type, a dictionary-like class that’s available for you in collections
.
The Python defaultdict
type behaves almost exactly like a regular Python dictionary, but if you try to access or modify a missing key, then defaultdict
will automatically create the key and generate a default value for it. This makes defaultdict
a valuable option for handling missing keys in dictionaries.
In this tutorial, you’ll learn:
- How to use the Python
defaultdict
type for handling missing keys in a dictionary - When and why to use a Python
defaultdict
rather than a regular dict
- How to use a
defaultdict
for grouping, counting, and accumulating operations
With this knowledge under your belt, you’ll be in a better condition to effectively use the Python defaultdict
type in your day-to-day programming challenges.
To get the most out of this tutorial, you should have some previous understanding of what Python dictionaries are and how to work with them. If you need to freshen up, then check out the following resources:
Handling Missing Keys in Dictionaries
A common issue that you can face when working with Python dictionaries is how to handle missing keys. If your code is heavily based on dictionaries, or if you’re creating dictionaries on the fly all the time, then you’ll soon notice that dealing with frequent KeyError
exceptions can be quite annoying and can add extra complexity to your code. With Python dictionaries, you have at least four available ways to handle missing keys:
- Use
.setdefault()
- Use
.get()
- Use the
key in dict
idiom - Use a
try
and except
block
The Python docs explain .setdefault()
and .get()
as follows:
setdefault(key[, default])
If key
is in the dictionary, return its value. If not, insert key
with a value of default
and return default
. default
defaults to None
.
get(key[, default])
Return the value for key
if key
is in the dictionary, else default
. If default
is not given, it defaults to None
, so that this method never raises a KeyError
.
(Source)
Here’s an example of how you can use .setdefault()
to handle missing keys in a dictionary:
>>>>>> a_dict={}>>> a_dict['missing_key']Traceback (most recent call last):
File "<stdin>", line 1, in <module>a_dict['missing_key']KeyError: 'missing_key'>>> a_dict.setdefault('missing_key','default value')'default value'>>> a_dict['missing_key']'default value'>>> a_dict.setdefault('missing_key','another default value')'default value'>>> a_dict{'missing_key': 'default value'}
In the above code, you use .setdefault()
to generate a default value for missing_key
. Notice that your dictionary, a_dict
, now has a new key called missing_key
whose value is 'default value'
. This key didn’t exist before you called .setdefault()
. Finally, if you call .setdefault()
on an existing key, then the call won’t have any effect on the dictionary. Your key will hold the original value instead of the new default value.
Note: In the above code example, you get an exception, and Python shows you a traceback message, which tells you that you’re trying to access a missing key in a_dict
. If you want to dive deeper into how to decipher and understand a Python traceback, then check out Understanding the Python Traceback.
On the other hand, if you use .get()
, then you can code something like this:
>>>>>> a_dict={}>>> a_dict.get('missing_key','default value')'default value'>>> a_dict{}
Here, you use .get()
to generate a default value for missing_key
, but this time, your dictionary stays empty. This is because .get()
returns the default value, but this value isn’t added to the underlying dictionary. For example, if you have a dictionary called D
, then you can assume that .get()
works something like this:
D.get(key, default) -> D[key] if key in D, else default
With this pseudo-code, you can understand how .get()
works internally. If the key exists, then .get()
returns the value mapped to that key. Otherwise, the default value is returned. Your code never creates or assigns a value to key
. In this example, default
defaults to None
.
You can also use conditional statements to handle missing keys in dictionaries. Take a look at the following example, which uses the key in dict
idiom:
>>>>>> a_dict={}>>> if'key'ina_dict:... # Do something with 'key'...... a_dict['key']... else:... a_dict['key']='default value'...>>> a_dict{'key': 'default value'}
In this code, you use an if
statement along with the in
operator to check if key
is present in a_dict
. If so, then you can perform any action with key
or with its value. Otherwise, you create the new key, key
, and assign it a 'default value'
. Note that the above code works similar to .setdefault()
but takes four lines of code, while .setdefault()
would only take one line (in addition to being more readable).
You can also walk around the KeyError
by using a try
and except
block to handle the exception. Consider the following piece of code:
>>>>>> a_dict={}>>> try:... # Do something with 'key'...... a_dict['key']... exceptKeyError:... a_dict['key']='default value'...>>> a_dict{'key': 'default value'}
The try
and except
block in the above example catches the KeyError
whenever you try to get access to a missing key. In the except
clause, you create the key
and assign it a 'default value'
.
Note: If missing keys are uncommon in your code, then you might prefer to use a try
and except
block (EAFP coding style) to catch the KeyError
exception. This is because the code doesn’t check the existence of every key and only handles a few exceptions, if any.
On the other hand, if missing keys are quite common in your code, then the conditional statement (LBYL coding style) can be a better choice because checking for keys can be less costly than handling frequent exceptions.
So far, you’ve learned how to handle missing keys using the tools that dict
and Python offer you. However, the examples you saw here are quite verbose and hard to read. They might not be as straightforward as you might want. That’s why the Python standard library provides a more elegant, Pythonic, and efficient solution. That solution is collections.defaultdict
, and that’s what you’ll be covering from now on.
Understanding the Python defaultdict
Type
The Python standard library provides collections
, which is a module that implements specialized container types. One of those is the Python defaultdict
type, which is an alternative to dict
that’s specifically designed to help you out with missing keys. defaultdict
is a Python type that inherits from dict
:
>>>>>> fromcollectionsimportdefaultdict>>> issubclass(defaultdict,dict)True
The above code shows that the Python defaultdict
type is a subclass of dict
. This means that defaultdict
inherits most of the behavior of dict
. So, you can say that defaultdict
is much like an ordinary dictionary.
The main difference between defaultdict
and dict
is that when you try to access or modify a key
that’s not present in the dictionary, a default value
is automatically given to that key
. In order to provide this functionality, the Python defaultdict
type does two things:
- It overrides
.__missing__()
. - It adds
.default_factory
, a writable instance variable that needs to be provided at the time of instantiation.
The instance variable .default_factory
will hold the first argument passed into defaultdict.__init__()
. This argument can take a valid Python callable or None
. If a callable is provided, then it’ll automatically be called by defaultdict
whenever you try to access or modify the value associated with a missing key.
Note: All the remaining arguments to the class initializer are treated as if they were passed to the initializer of regular dict
, including the keyword arguments.
Take a look at how you can create and properly initialize a defaultdict
:
>>>>>> # Correct instantiation>>> def_dict=defaultdict(list)# Pass list to .default_factory>>> def_dict['one']=1# Add a key-value pair>>> def_dict['missing']# Access a missing key returns an empty list[]>>> def_dict['another_missing'].append(4)# Modify a missing key>>> def_dictdefaultdict(<class 'list'>, {'one': 1, 'missing': [], 'another_missing': [4]})
Here, you pass list
to .default_factory
when you create the dictionary. Then, you use def_dict
just like a regular dictionary. Note that when you try to access or modify the value mapped to a non-existent key, the dictionary assigns it the default value that results from calling list()
.
Keep in mind that you must pass a valid Python callable object to .default_factory
, so remember not to call it using the parentheses at initialization time. This can be a common issue when you start using the Python defaultdict
type. Take a look at the following code:
>>>>>> # Wrong instantiation>>> def_dict=defaultdict(list())Traceback (most recent call last):
File "<stdin>", line 1, in <module>def_dict=defaultdict(list())TypeError: first argument must be callable or None
Here, you try to create a defaultdict
by passing list()
to .default_factory
. The call to list()
raises a TypeError
, which tells you that the first argument must be callable or None
.
With this introduction to the Python defaultdict
type, you can get start coding with practical examples. The next few sections will walk you through some common use cases where you can rely on a defaultdict
to provide an elegant, efficient, and Pythonic solution.
Using the Python defaultdict
Type
Sometimes, you’ll use a mutable built-in collection (a list
, dict
, or set
) as values in your Python dictionaries. In these cases, you’ll need to initialize the keys before first use, or you’ll get a KeyError
. You can either do this process manually or automate it using a Python defaultdict
. In this section, you’ll learn how to use the Python defaultdict
type for solving some common programming problems:
- Grouping the items in a collection
- Counting the items in a collection
- Accumulating the values in a collection
You’ll be covering some examples that use list
, set
, int
, and float
to perform grouping, counting, and accumulating operations in a user-friendly and efficient way.
Grouping Items
A typical use of the Python defaultdict
type is to set .default_factory
to list
and then build a dictionary that maps keys to lists of values. With this defaultdict
, if you try to get access to any missing key, then the dictionary runs the following steps:
- Call
list()
to create a new empty list
- Insert the empty
list
into the dictionary using the missing key as key
- Return a reference to that
list
This allows you to write code like this:
>>>>>> fromcollectionsimportdefaultdict>>> dd=defaultdict(list)>>> dd['key'].append(1)>>> dddefaultdict(<class 'list'>, {'key': [1]})>>> dd['key'].append(2)>>> dddefaultdict(<class 'list'>, {'key': [1, 2]})>>> dd['key'].append(3)>>> dddefaultdict(<class 'list'>, {'key': [1, 2, 3]})
Here, you create a Python defaultdict
called dd
and pass list
to .default_factory
. Notice that even when key
isn’t defined, you can append values to it without getting a KeyError
. That’s because dd
automatically calls .default_factory
to generate a default value for the missing key
.
You can use defaultdict
along with list
to group the items in a sequence or a collection. Suppose that you’ve retrieved the following data from your company’s database:
Department | Employee Name |
---|
Sales | John Doe |
Sales | Martin Smith |
Accounting | Jane Doe |
Marketing | Elizabeth Smith |
Marketing | Adam Doe |
… | … |
With this data, you create an initial list
of tuple
objects like the following:
dep=[('Sales','John Doe'),('Sales','Martin Smith'),('Accounting','Jane Doe'),('Marketing','Elizabeth Smith'),('Marketing','Adam Doe')]
Now, you need to create a dictionary that groups the employees by department. To do this, you can use a defaultdict
as follows:
fromcollectionsimportdefaultdictdep_dd=defaultdict(list)fordepartment,employeeindep:dep_dd[department].append(employee)
Here, you create a defaultdict
called dep_dd
and use a for
loop to iterate through your dep
list. The statement dep_dd[department].append(employee)
creates the keys for the departments, initializes them to an empty list, and then appends the employees to each department. Once you run this code, your dep_dd
will look something like this:
>>>defaultdict(<class 'list'>, {'Sales': ['John Doe', 'Martin Smith'], 'Accounting' : ['Jane Doe'], 'Marketing': ['Elizabeth Smith', 'Adam Doe']})
In this example, you group the employees by their department using a defaultdict
with .default_factory
set to list
. To do this with a regular dictionary, you can use dict.setdefault()
as follows:
dep_d=dict()fordepartment,employeeindep:dep_d.setdefault(department,[]).append(employee)
This code is straightforward, and you’ll find similar code quite often in your work as a Python coder. However, the defaultdict
version is arguably more readable, and for large datasets, it can also be a lot faster and more efficient. So, if speed is a concern for you, then you should consider using a defaultdict
instead of a standard dict
.
Grouping Unique Items
Continue working with the data of departments and employees from the previous section. After some processing, you realize that a few employees have been duplicated in the database by mistake. You need to clean up the data and remove the duplicated employees from your dep_dd
dictionary. To do this, you can use a set
as the .default_factory
and rewrite your code as follows:
dep=[('Sales','John Doe'),('Sales','Martin Smith'),('Accounting','Jane Doe'),('Marketing','Elizabeth Smith'),('Marketing','Elizabeth Smith'),('Marketing','Adam Doe'),('Marketing','Adam Doe'),('Marketing','Adam Doe')]dep_dd=defaultdict(set)fordepartment,employeeinitems:dep_dd[department].add(employee)
In this example, you set .default_factory
to set
. Sets are collections of unique objects, which means that you can’t create a set
with repeated items. This is a really interesting feature of sets, which guarantees that you won’t have repeated items in your final dictionary.
Counting Items
If you set .default_factory
to int
, then your defaultdict
will be useful for counting the items in a sequence or collection. When you call int()
with no arguments, the function returns 0
, which is the typical value you’d use to initialize a counter.
To continue with the example of the company database, suppose you want to build a dictionary that counts the number of employees per department. In this case, you can code something like this:
>>>>>> fromcollectionsimportdefaultdict>>> dep=[('Sales','John Doe'),... ('Sales','Martin Smith'),... ('Accounting','Jane Doe'),... ('Marketing','Elizabeth Smith'),... ('Marketing','Adam Doe')]>>> dd=defaultdict(int)>>> fordepartment,_indep:... dd[department]+=1>>> dddefaultdict(<class 'int'>, {'Sales': 2, 'Accounting': 1, 'Marketing': 2})
Here, you set .default_factory
to int
. When you call int()
with no argument, the returned value is 0
. You can use this default value to start counting the employees that work in each department. For this code to work correctly, you need a clean dataset. There must be no repeated data. Otherwise, you’ll need to filter out the repeated employees.
Another example of counting items is the mississippi
example, where you count the number of times each letter in a word is repeated. Take a look at the following code:
>>>>>> fromcollectionsimportdefaultdict>>> s='mississippi'>>> dd=defaultdict(int)>>> forletterins:... dd[letter]+=1...>>> dddefaultdict(<class 'int'>, {'m': 1, 'i': 4, 's': 4, 'p': 2})
In the above code, you create a defaultdict
with .default_factory
set to int
. This sets the default value for any given key to 0
. Then, you use a for
loop to traverse the strings
and use an augmented assignment operation to add 1
to the counter in every iteration. The keys of dd
will be the letters in mississippi
.
Note: Python’s augmented assignment operators are a handy shortcut to common operations.
Take a look at the following examples:
var += 1
is equivalent to var = var + 1
var -= 1
is equivalent to var = var - 1
var *= 1
is equivalent to var = var * 1
This is just a sample of how the augmented assignment operators work. You can take a look at the official documentation to learn more about this feature.
As counting is a relatively common task in programming, the Python dictionary-like class collections.Counter
is specially designed for counting items in a sequence. With Counter
, you can write the mississippi
example as follows:
>>>>>> fromcollectionsimportCounter>>> counter=Counter('mississippi')>>> counterCounter({'i': 4, 's': 4, 'p': 2, 'm': 1})
In this case, Counter
does all the work for you! You only need to pass in a sequence, and the dictionary will count its items, storing them as keys and the counts as values. Note that this example works because Python strings are also a sequence type.
Accumulating Values
Sometimes you’ll need to calculate the total sum of the values in a sequence or collection. Let’s say you have the following Excel sheet with data about the sales of your Python website:
Products | July | August | September |
---|
Books | 1250.00 | 1300.00 | 1420.00 |
Tutorials | 560.00 | 630.00 | 750.00 |
Courses | 2500.00 | 2430.00 | 2750.00 |
Next, you process the data using Python and get the following list
of tuple
objects:
incomes=[('Books',1250.00),('Books',1300.00),('Books',1420.00),('Tutorials',560.00),('Tutorials',630.00),('Tutorials',750.00),('Courses',2500.00),('Courses',2430.00),('Courses',2750.00),]
With this data, you want to calculate the total income per product. To do that, you can use a Python defaultdict
with float
as .default_factory
and then code something like this:
1 fromcollectionsimportdefaultdict 2 3 dd=defaultdict(float) 4 forproduct,incomeinincomes: 5 dd[product]+=income 6 7 forproduct,incomeindd.items(): 8 print(f'Total income for {product}: ${income:,.2f}')
Here’s what this code does:
- In line 1, you import the Python
defaultdict
type. - In line 3, you create a
defaultdict
object with .default_factory
set to float
. - In line 4, you define a
for
loop to iterate through the items of incomes
. - In line 5, you use an augmented assignment operation (
+=
) to accumulate the incomes per product in the dictionary.
The second loop iterates through the items of dd
and prints the incomes to your screen.
If you put all this code into a file called incomes.py
and run it from your command line, then you’ll get the following output:
$ python3 incomes.py
Total income for Books: $3,970.00Total income for Tutorials: $1,940.00Total income for Courses: $7,680.00
You now have a summary of incomes per product, so you can make decisions on which strategy to follow for increasing the total income of your site.
Diving Deeper Into defaultdict
So far, you’ve learned how to use the Python defaultdict
type by coding some practical examples. At this point, you can dive deeper into type implementation and other working details. That’s what you’ll be covering in the next few sections.
defaultdict
vs dict
For you to better understand the Python defaultdict
type, a good exercise would be to compare it with its superclass, dict
. If you want to know the methods and attributes that are specific to the Python defaultdict
type, then you can run the following line of code:
>>>>>> set(dir(defaultdict))-set(dir(dict)){'__copy__', 'default_factory', '__missing__'}
In the above code, you use dir()
to get the list of valid attributes for dict
and defaultdict
. Then, you use a set
difference to get the set of methods and attributes that you can only find in defaultdict
. As you can see, the differences between these two classes are. You have two methods and one instance attribute. The following table shows what the methods and the attribute are for:
Method or Attribute | Description |
---|
.__copy__() | Provides support for copy.copy() |
.default_factory | Holds the callable invoked by .__missing__() to automatically provide default values for missing keys |
.__missing__(key) | Gets called when .__getitem__() can’t find key |
In the above table, you can see the methods and the attribute that make a defaultdict
different from a regular dict
. The rest of the methods are the same in both classes.
Note: If you initialize a defaultdict
using a valid callable, then you won’t get a KeyError
when you try to get access to a missing key. Any key that doesn’t exist gets the value returned by .default_factory
.
Additionally, you might notice that a defaultdict
is equal to a dict
with the same items:
>>>>>> std_dict=dict(numbers=[1,2,3],letters=['a','b','c'])>>> std_dict{'numbers': [1, 2, 3], 'letters': ['a', 'b', 'c']}>>> def_dict=defaultdict(list,numbers=[1,2,3],letters=['a','b','c'])>>> def_dictdefaultdict(<class 'list'>, {'numbers': [1, 2, 3], 'letters': ['a', 'b', 'c']})>>> std_dict==def_dictTrue
Here, you create a regular dictionary std_dict
with some arbitrary items. Then, you create a defaultdict
with the same items. If you test both dictionaries for content equality, then you’ll see that they’re equal.
defaultdict.default_factory
The first argument to the Python defaultdict
type must be a callable that takes no arguments and returns a value. This argument is assigned to the instance attribute, .default_factory
. For this, you can use any callable, including functions, methods, classes, type objects, or any other valid callable. The default value of .default_factory
is None
.
If you instantiate defaultdict
without passing a value to .default_factory
, then the dictionary will behave like a regular dict
and the usual KeyError
will be raised for missing key lookup or modification attempts:
>>>>>> fromcollectionsimportdefaultdict>>> dd=defaultdict()>>> dd['missing_key']Traceback (most recent call last):
File "<stdin>", line 1, in <module>dd['missing_key']KeyError: 'missing_key'
Here, you instantiate the Python defaultdict
type with no arguments. In this case, the instance behaves like a standard dictionary. So, if you try to access or modify a missing key, then you’ll get the usual KeyError
. From this point on, you can use dd
as a normal Python dictionary and, unless you assign a new callable to .default_factory
, you won’t be able to use the ability of defaultdict
to handle missing keys automatically.
If you pass None
to the first argument of defaultdict
, then the instance will behave the same way you saw in the above example. That’s because .default_factory
defaults to None
, so both initializations are equivalent. On the other hand, if you pass a valid callable object to .default_factory
, then you can use it to handle missing keys in a user-friendly way. Here’s an example where you pass list
to .default_factory
:
>>>>>> dd=defaultdict(list,letters=['a','b','c'])>>> dd.default_factory<class 'list'>>>> dddefaultdict(<class 'list'>, {'letters': ['a', 'b', 'c']})>>> dd['numbers'][]>>> dddefaultdict(<class 'list'>, {'letters': ['a', 'b', 'c'], 'numbers': []})>>> dd['numbers'].append(1)>>> dddefaultdict(<class 'list'>, {'letters': ['a', 'b', 'c'], 'numbers': [1]})>>> dd['numbers']+=[2,3]>>> dddefaultdict(<class 'list'>, {'letters': ['a', 'b', 'c'], 'numbers': [1, 2, 3]})
In this example, you create a Python defaultdict
called dd
, then you use list
for its first argument. The second argument is called letters
and holds a list of letters. You see that .default_factory
now holds a list
object that will be called when you need to supply a default value
for any missing key.
Notice that when you try to access numbers
, dd
tests if numbers
is in the dictionary. If it’s not, then it calls .default_factory()
. Since .default_factory
holds a list
object, the returned value
is an empty list ([]
).
Now that dd['numbers']
is initialized with an empty list
, you can use .append()
to add elements to the list
. You can also use an augmented assignment operator (+=
) to concatenate the lists [1]
and [2, 3]
. This way, you can handle missing keys in a more Pythonic and more efficient way.
On the other hand, if you pass a non-callable object to the initializer of the Python defaultdict
type, then you’ll get a TypeError
like in the following code:
>>>>>> defaultdict(0)Traceback (most recent call last):
File "<stdin>", line 1, in <module>defaultdict(0)TypeError: first argument must be callable or None
Here, you pass 0
to .default_factory
. Since 0
is not a callable object, you get a TypeError
telling you that the first argument must be callable or None
. Otherwise, defaultdict
doesn’t work.
Keep in mind that .default_factory
is only called from .__getitem__()
and not from other methods. This means that if dd
is a defaultdict
and key
is a missing key, then dd[key]
will call .default_factory
to provide a default value
, but dd.get(key)
still returns None
instead of the value that .default_factory
would provide. That’s because .get()
doesn’t call .__getitem__()
to retrieve the key
.
Take a look at the following code:
>>>>>> dd=defaultdict(list)>>> # Calls dd.__getitem__('missing')>>> dd['missing'][]>>> # Don't call dd.__getitem__('another_missing')>>> print(dd.get('another_missing'))None>>> dddefaultdict(<class 'list'>, {'missing': []})
In this code fragment, you can see that dd.get()
returns None
rather than the default value that .default_factory
would provide. That’s because .default_factory
is only called from .__missing__()
, which is not called by .get()
.
Notice that you can also add arbitrary values to a Python defaultdict
. This means that you’re not limited to values with the same type as the values generated by .default_factory
. Here’s an example:
>>>>>> dd=defaultdict(list)>>> dddefaultdict(<class 'list'>, {})>>> dd['string']='some string'>>> dddefaultdict(<class 'list'>, {'string': 'some string'})>>> dd['list'][]>>> dddefaultdict(<class 'list'>, {'string': 'some string', 'list': []})
Here, you create a defaultdict
and pass in a list
object to .default_factory
. This sets your default values to be empty lists. However, you can freely add a new key that holds values of a different type. That’s the case with the key string
, which holds a str
object instead of a list
object.
Finally, you can always change or update the callable you initially assign to .default_factory
in the same way you would do with any instance attribute:
>>>>>> dd.default_factory=str>>> dd['missing_key']''
In the above code, you change .default_factory
from list
to str
. Now, whenever you try to get access to a missing key, your default value will be an empty string (''
).
Depending on your use cases for the Python defaultdict
type, you might need to freeze the dictionary once you finish creating it and make it read-only. To do this, you can set .default_factory
to None
after you finish populating the dictionary. This way, your dictionary will behave like a standard dict
, which means you won’t have more automatically generated default values.
defaultdict
vs dict.setdefault()
As you saw before, dict
provides .setdefault()
, which will allow you to assign values to missing keys on the fly. In contrast, with a defaultdict
you can specify the default value up front when you initialize the container. You can use .setdefault()
to assign default values as follows:
>>>>>> d=dict()>>> d.setdefault('missing_key',[])[]>>> d{'missing_key': []}
In this code, you create a regular dictionary and then use .setdefault()
to assign a value ([]
) to the key missing_key
, which wasn’t defined yet.
Note: You can assign any type of Python object using .setdefault()
. This is an important difference compared to defaultdict
if you consider that defaultdict
only accepts a callable or None
.
On the other hand, if you use a defaultdict
to accomplish the same task, then the default value is generated on demand whenever you try to access or modify a missing key. Notice that, with defaultdict
, the default value is generated by the callable you pass upfront to the initializer of the class. Here’s how it works:
>>>>>> fromcollectionsimportdefaultdict>>> dd=defaultdict(list)>>> dd['missing_key'][]>>> dddefaultdict(<class 'list'>, {'missing_key': []})
Here, you first import the Python defaultdict
type from collections
. Then, you create a defaultdict
and pass list
to .default_factory
. When you try to get access to a missing key, defaultdict
internally calls .default_factory()
, which holds a reference to list
, and assigns the resulting value (an empty list
) to missing_key
.
The code in the above two examples does the same work, but the defaultdict
version is arguably more readable, user-friendly, Pythonic, and straightforward.
Note: A call to a built-in type like list
, set
, dict
, str
, int
, or float
will return an empty object or zero for numeric types.
Take a look at the following code examples:
>>>>>> list()[]>>> set()set([])>>> dict(){}>>> str()''>>> float()0.0>>> int()0
In this code, you call some built-in types with no arguments and get an empty object or zero for the numeric types.
Finally, using a defaultdict
to handle missing keys can be faster than using dict.setdefault()
. Take a look a the following example:
# Filename: exec_time.pyfromcollectionsimportdefaultdictfromtimeitimporttimeitanimals=[('cat',1),('rabbit',2),('cat',3),('dog',4),('dog',1)]std_dict=dict()def_dict=defaultdict(list)defgroup_with_dict():foranimal,countinanimals:std_dict.setdefault(animal,[]).append(count)returnstd_dictdefgroup_with_defaultdict():foranimal,countinanimals:def_dict[animal].append(count)returndef_dictprint(f'dict.setdefault() takes {timeit(group_with_dict)} seconds.')print(f'defaultdict takes {timeit(group_with_defaultdict)} seconds.')
If you run the script from your system’s command line, then you’ll get something like this:
$ python3 exec_time.py
dict.setdefault() takes 1.0281260240008123 seconds.defaultdict takes 0.6704721650003194 seconds.
Here, you use timeit.timeit()
to measure the execution time of group_with_dict()
and group_with_defaultdict()
. These functions perform equivalent actions, but the first uses dict.setdefault()
, and the second uses a defaultdict
. The time measure will depend on your current hardware, but you can see here that defaultdict
is faster than dict.setdefault()
. This difference can become more important as the dataset gets larger.
Additionally, you need to consider that creating a regular dict
can be faster than creating a defaultdict
. Take a look at this code:
>>>>>> fromtimeitimporttimeit>>> fromcollectionsimportdefaultdict>>> print(f'dict() takes {timeit(dict)} seconds.')dict() takes 0.08921320698573254 seconds.>>> print(f'defaultdict() takes {timeit(defaultdict)} seconds.')defaultdict() takes 0.14101867799763568 seconds.
This time, you use timeit.timeit()
to measure the execution time of dict
and defaultdict
instantiation. Notice that creating a dict
takes almost half the time of creating a defaultdict
. This might not be a problem if you consider that, in real-world code, you normally instantiate defaultdict
only once.
Also notice that, by default, timeit.timeit()
will run your code a million times. That’s the reason for defining std_dict
and def_dict
out of the scope of group_with_dict()
and group_with_defaultdict()
in exec_time.py
. Otherwise, the time measure will be affected by the instantiation time of dict
and defaultdict
.
At this point, you may have an idea of when to use a defaultdict
rather than a regular dict
. Here are three things to take into account:
If your code is heavily base on dictionaries and you’re dealing with missing keys all the time, then you should consider using a defaultdict
rather than a regular dict
.
If your dictionary items need to be initialized with a constant default value, then you should consider using a defaultdict
instead of a dict
.
If your code relies on dictionaries for aggregating, accumulating, counting, or grouping values, and performance is a concern, then you should consider using a defaultdict
.
You can consider the above guidelines when deciding whether to use a dict
or a defaultdict
.
defaultdict.__missing__()
Behind the scenes, the Python defaultdict
type works by calling .default_factory
to supply default values to missing keys. The mechanism that makes this possible is .__missing__()
, a special method supported by all the standard mapping types, including dict
and defaultdict
.
Note: Note that .__missing__()
is automatically called by .__getitem__()
to handle missing keys and that .__getitem__()
is automatically called by Python at the same time for subscription operations like d[key]
.
So, how does .__missing__()
work? If you set .default_factory
to None
, then .__missing__()
raises a KeyError
with the key
as an argument. Otherwise, .default_factory
is called without arguments to provide a default value
for the given key
. This value
is inserted into the dictionary and finally returned. If calling .default_factory
raises an exception, then the exception is propagated unchanged.
The following code shows a viable Python implementation for .__missing__()
:
1 def__missing__(self,key): 2 ifself.default_factoryisNone: 3 raiseKeyError(key) 4 ifkeynotinself: 5 self[key]=self.default_factory() 6 returnself[key]
Here’s what this code does:
- In line 1, you define the method and its signature.
- In lines 2 and 3, you test to see if
.default_factory
is None
. If so, then you raise a KeyError
with the key
as an argument. - In lines 4 and 5, you check if the
key
is not in the dictionary. If it’s not, then you call .default_factory
and assign its return value to the key
. - In line 6, you return the
key
as expected.
Keep in mind that the presence of .__missing__()
in a mapping has no effect on the behavior of other methods that look up keys, such as .get()
or .__contains__()
, which implements the in
operator. That’s because .__missing__()
is only called by .__getitem__()
when the requested key
is not found in the dictionary. Whatever .__missing__()
returns or raises is then returned or raised by .__getitem__()
.
Now that you’ve covered an alternative Python implementation for .__missing__()
, it would be a good exercise to try to emulate defaultdict
with some Python code. That’s what you’ll be doing in the next section.
Emulating the Python defaultdict
Type
In this section, you’ll be coding a Python class that will behave much like a defaultdict
. To do that, you’ll subclass collections.UserDict
and then add .__missing__()
. Also, you need to add an instance attribute called .default_factory
, which will hold the callable for generating default values on demand. Here’s a piece of code that emulates most of the behavior of the Python defaultdict
type:
1 importcollections 2 3 classmy_defaultdict(collections.UserDict): 4 def__init__(self,default_factory=None,*args,**kwargs): 5 super().__init__(*args,**kwargs) 6 ifnotcallable(default_factory)anddefault_factoryisnotNone: 7 raiseTypeError('first argument must be callable or None') 8 self.default_factory=default_factory 9 10 def__missing__(self,key):11 ifself.default_factoryisNone:12 raiseKeyError(key)13 ifkeynotinself:14 self[key]=self.default_factory()15 returnself[key]
Here’s how this code works:
In line 1, you import collections
to get access to UserDict
.
In line 3, you create a class that subclasses UserDict
.
In line 4, you define the class initializer .__init__()
. This method takes an argument called default_factory
to hold the callable that you’ll use to generate the default values. Notice that default_factory
defaults to None
, just like in a defaultdict
. You also need the *args
and **kwargs
for emulating the normal behavior of a regular dict
.
In line 5, you call the superclass .__init__()
. This means that you’re calling UserDict.__init__()
and passing *args
and **kwargs
to it.
In line 6, you first check if default_factory
is a valid callable object. In this case, you use callable(object)
, which is a built-in function that returns True
if object
appears to be a callable and otherwise returns False
. This check ensures that you can call .default_factory()
if you need to generate a default value
for any missing key
. Then, you check if .default_factory
is not None
.
In line 7, you raise a TypeError
just like a regular dict
would do if default_factory
is None
.
In line 8, you initialize .default_factory
.
In line 10, you define .__missing__()
, which is implemented as you saw before. Recall that .__missing__()
is automatically called by .__getitem__()
when a given key
is not in a dictionary.
If you feel in the mood to read some C code, then you can take a look at the full code for the Python defaultdict
Type in the CPython source code.
Now that you’ve finished coding this class, you can test it by putting the code into a Python script called my_dd.py
and importing it from an interactive session. Here’s an example:
>>>>>> frommy_ddimportmy_defaultdict>>> dd_one=my_defaultdict(list)>>> dd_one{}>>> dd_one['missing'][]>>> dd_one{'missing': []}>>> dd_one.default_factory=int>>> dd_one['another_missing']0>>> dd_one{'missing': [], 'another_missing': 0}>>> dd_two=my_defaultdict(None)>>> dd_two['missing']Traceback (most recent call last):
File "<stdin>", line 1, in <module>dd_two['missing'] File "/home/user/my_dd.py", line 10, in __missing__ raise KeyError(key)KeyError: 'missing'
Here, you first import my_defaultdict
from my_dd
. Then, you create an instance of my_defaultdict
and pass list
to .default_factory
. If you try to get access to a key with a subscription operation, like dd_one['missing']
, then .__getitem__()
is automatically called by Python. If the key is not in the dictionary, then .__missing__()
is called, which generates a default value by calling .default_factory()
.
You can also change the callable assigned to .default_factory
using a normal assignment operation like in dd_one.default_factory = int
. Finally, if you pass None
to .default_factory
, then you’ll get a KeyError
when trying to retrieve a missing key.
Note: The behavior of a defaultdict
is essentially the same as this Python equivalent. However, you’ll soon note that your Python implementation doesn’t print as a real defaultdict
but as a standard dict
. You can modify this detail by overriding .__str__()
and .__repr__()
.
You may be wondering why you subclass collections.UserDict
instead of a regular dict
for this example. The main reason for this is that subclassing built-in types can be error-prone because the C code of the built-ins doesn’t seem to consistently call special methods overridden by the user.
Here’s an example that shows some issues that you can face when subclassing dict
:
>>>>>> classMyDict(dict):... def__setitem__(self,key,value):... super().__setitem__(key,None)...>>> my_dict=MyDict(first=1)>>> my_dict{'first': 1}>>> my_dict['second']=2>>> my_dict{'first': 1, 'second': None}>>> my_dict.setdefault('third',3)3>>> my_dict{'first': 1, 'second': None, 'third': 3}
In this example, you create MyDict
, which is a class that subclasses dict
. Your implementation of .__setitem__()
always sets values to None
. If you create an instance of MyDict
and pass a keyword argument to its initializer, then you’ll notice the class is not calling your .__setitem__()
to handle the assignment. You know that because the key first
wasn’t assigned None
.
By contrast, if you run a subscription operation like my_dict['second'] = 2
, then you’ll notice that second
is set to None
rather than to 2
. So, this time you can say that subscription operations call your custom .__setitem__()
. Finally, notice that .setdefault()
doesn’t call .__setitem__()
either, because your third
key ends up with a value of 3
.
UserDict
doesn’t inherit from dict
but simulates the behavior of a standard dictionary. The class has an internal dict
instance called .data
, which is used to store the content of the dictionary. UserDict
is a more reliable class when it comes to creating custom mappings. If you use UserDict
, then you’ll be avoiding the issues you saw before. To prove this, go back to the code for my_defaultdict
and add the following method:
1 classmy_defaultdict(collections.UserDict): 2 # Snip 3 def__setitem__(self,key,value): 4 print('__setitem__() gets called') 5 super().__setitem__(key,None)
Here, you add a custom .__setitem__()
that calls the superclass .__setitem__()
, which always sets the value to None
. Update this code in your script my_dd.py
and import it from an interactive session as follows:
>>>>>> frommy_ddimportmy_defaultdict>>> my_dict=my_defaultdict(list,first=1)__setitem__() gets called>>> my_dict{'first': None}>>> my_dict['second']=2__setitem__() gets called>>> my_dict{'first': None, 'second': None}
In this case, when you instantiate my_defaultdict
and pass first
to the class initializer, your custom __setitem__()
gets called. Also, when you assign a value to the key second
, __setitem__()
gets called as well. You now have a my_defaultdict
that consistently calls your custom special methods. Notice that all the values in the dictionary are equal to None
now.
Passing Arguments to .default_factory
As you saw earlier, .default_factory
must be set to a callable object that takes no argument and returns a value. This value will be used to supply a default value for any missing key in the dictionary. Even when .default_factory
shouldn’t take arguments, Python offers some tricks that you can use if you need to supply arguments to it. In this section, you’ll cover two Python tools that can serve this purpose:
lambda
functools.partial()
With these two tools, you can add extra flexibility to the Python defaultdict
type. For example, you can initialize a defaultdict
with a callable that takes an argument and, after some processing, you can update the callable with a new argument to change the default value for the keys you’ll create from this point on.
Using lambda
A flexible way to pass arguments to .default_factory
is to use lambda
. Suppose you want to create a function to generate default values in a defaultdict
. The function does some processing and returns a value, but you need to pass an argument for the function to work correctly. Here’s an example:
>>>>>> deffactory(arg):... # Do some processing here...... result=arg.upper()... returnresult...>>> def_dict=defaultdict(lambda:factory('default value'))>>> def_dict['missing']'DEFAULT VALUE'
In the above code, you create a function called factory()
. The function takes an argument, does some processing, and returns the final result. Then, you create a defaultdict
and use lambda
to pass the string 'default value'
to factory()
. When you try to get access to a missing key, the following steps are run:
- The dictionary
def_dict
calls its .default_factory
, which holds a reference to a lambda
function. - The
lambda
function gets called and returns the value that results from calling factory()
with 'default value'
as an argument.
If you’re working with def_dict
and suddenly need to change the argument to factory()
, then you can do something like this:
>>>>>> def_dict.default_factory=factory('another default value')>>> def_dict['another_missing']'ANOTHER DEFAULT VALUE'
This time, factory()
takes a new string argument ('another default value'
). From now on, if you try to access or modify a missing key, then you’ll get a new default value, which is the string 'ANOTHER DEFAULT VALUE'
.
Finally, you can possibly face a situation where you need a default value that’s different from 0
or []
. In this case, you can also use lambda
to generate a different default value. For example, suppose you have a list
of integer numbers, and you need to calculate the cumulative product of each number. Then, you can use a defaultdict
along with lambda
as follows:
>>>>>> fromcollectionsimportdefaultdict>>> lst=[1,1,2,1,2,2,3,4,3,3,4,4]>>> def_dict=defaultdict(lambda:1)>>> fornumberinlst:... def_dict[number]*=number...>>> def_dictdefaultdict(<function <lambda> at 0x...70>, {1: 1, 2: 8, 3: 27, 4: 64})
Here, you use lambda
to supply a default value of 1
. With this initial value, you can calculate the cumulative product of each number in lst
. Notice that you can’t get the same result using int
because the default value returned by int
is always 0
, which is not a good initial value for the multiplication operations you need to perform here.
functools.partial(func, *args, **keywords)
is a function that returns a partial
object. When you call this object with the positional arguments (args
) and keyword arguments (keywords
), it behaves similar to when you call func(*args, **keywords)
. You can take advantage of this behavior of partial()
and use it to pass arguments to .default_factory
in a Python defaultdict
. Here’s an example:
>>>>>> deffactory(arg):... # Do some processing here...... result=arg.upper()... returnresult...>>> fromfunctoolsimportpartial>>> def_dict=defaultdict(partial(factory,'default value'))>>> def_dict['missing']'DEFAULT VALUE'>>> def_dict.default_factory=partial(factory,'another default value')>>> def_dict['another_missing']'ANOTHER DEFAULT VALUE'
Here, you create a Python defaultdict
and use partial()
to supply an argument to .default_factory
. Notice that you can also update .default_factory
to use another argument for the callable factory()
. This kind of behavior can add a lot of flexibility to your defaultdict
objects.
Conclusion
The Python defaultdict
type is a dictionary-like data structure provided by the Python standard library in a module called collections
. The class inherits from dict
, and its main added functionality is to supply default values for missing keys. In this tutorial, you’ve learned how to use the Python defaultdict
type for handling the missing keys in a dictionary.
You’re now able to:
- Create and use a Python
defaultdict
to handle missing keys - Solve real-world problems related to grouping, counting, and accumulating operations
- Know the implementation differences between
defaultdict
and dict
- Decide when and why to use a Python
defaultdict
rather than a standard dict
The Python defaultdict
type is a convenient and efficient data structure that’s designed to help you out when you’re dealing with missing keys in a dictionary. Give it a try and make your code faster, more readable, and more Pythonic!
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]